V 



2D'N2a 541 

AUTHO? 
TITLE 



1 



•3P0NS »AGSNCY 

BUE3A0 NO 
P0& DAT2 
COHTEICT 

> 'NOTE 

'EL5S ??ICS . 
DESCEIFTORS 



DOCqilENT RESafiE, 

"P5 



SP 010 157 




)£NriFIZ?S 



Goulet, -Ldrry 3. ; And Others 

Investigation of Methodological Problems in / / 
Sducac^nal Research: Lpngitirdinal Methodology. ^Final 
Itepoi 

I l]4nois Uni V. , Urban a. 

National Inst, of Sducation (DHE'W) , Washiii.gton, 

C. * ^ * 

-4-1114 ^ 

ep 75 '^-^ 

26.1p. - . . • ' 

KF-'$0.83 HC-$14.05 Plus ' Postage • 

^Behavior Change; ^Educational Pesearch; =<^Equated 
.Scores; ^Longitudinal Studies; ^Measurement 
Technique s; Mental Development ; Research "-Design; 
r^Pesearch Methodology; Standardized Tests; Testings 
Test Interpretation; Tifte Factors (Learaii^) ; True 
Scores ^ ^ . 
=*^Aochor -Test Study 



AESTHACJ * - ^ 

The problems and issues involved in the conduct of 
educatioR'al-4^velopment.al research are examined within the 
'per s pect i ve^ o i i -ei^i tudlnaJL. research nethodology. Chapters, 2 and 3 
.examine contemporary research design^ aiid procedures implemented for 
the selection of subjects and *t»s|ting of behavior over time* 
Particular attention is given to -the sequential research paradigms 
dev^lopced by S'chaie for the^urpose of simultaneous assessment of 
age-, cohort-, and time-,jr^at^d biehavior change. Some of the common 
proJbtems- in measuring change and piodels for analyzing longitudinal- 
dataware considered' in Chapters through 7. Barticular emphasis is 
giver^ to -interpretive, problems resulting " from the properties of 
scales widely ^used ^^iatT star^dardized achievement tests,- to the 
limita^tions of current techniques lof vertical equating and 
consideration of. alternative equating methods, and to t|ie evaluation^ 
of^tjie constancy of a construct over time- Chapter 8 presents an 
e^xposition o^" time-ser^ies. analysis! along with a.-^ew procedure for 
parameter estijnation especially adapted dar^a from^ longitudinal 
studies.^ In» chapter 9 the problem^! af mea^HJ^ment of true ^change are 
reconsidered, and ''it is stated^fiat lower and upper bounds for 
estimated true change are derived' under more relaxed conditons than 
in classical test ^theory. The appendix reviews the report of/the 
Anchor ^Testj , Study . (Author/MM) , / , ^ 



t)ocuments acquired by ERIC include marry informal unpublished materials not available from other sources. ERIC makes every 
effort to"obtain the best copy available. Nevertheless, items of marginal reproducibiUty are often encountered and this aff^ts the 
quality of the microfiche and hardcopy reproductions ERIC makes Available via the ERIC Document Reproduction Service (EDRS). 
EDR§ is not responsible for the quality of the original document. Reproductions suppUed by EDRS are the best that can be made from 
the original. ' * ' - 




PROJECT 1.0. A-iax4' 
CONTKACT NO*. N i L:'-q-74-0i24 



INVESTIGATION OF MKThUuOLOGiCA'L ^ vKi) 
IN- EDLXATIONAb RESLAUCII: LOInGITUDINAL KcTuODOLOGi 



ROBi:*RT L% ultNis 



MAURICE M. VATSUGk^V //' 

UIsIVERSITY OF ILLINOIS ty' 
URBA:sA-i:iLAI^IPAIGN^ ILLINOIS 



SEPTEMBCF.,/d975 



The resea*rch readied Uiirtiin was performed ;,u/su:,nc 
a contract w^trlT the N'ationaa Institute' oX i^i^aucaciun , 
Departmep-r^' Health, Education, andWellare, C;oa- > 
trap&cffs undertaking, buch projects under . Govei'ar..jac 
-Sponsorship are enLOura^;ed ^0 express freely iheaL 

rBionar^d7(BTiTerrt--in the conduct of t;he project* 
Points of view or opinions stated do not, theire^oro., . 
necessarily represent official National Institute 
of Education position or Policy. 



ERIC 



U*S.:'l)EPARTKENT OF HEALTH,* 
EDUCATION, AxTJ l^ELFARE , 

NATIONAL INSTITUTE. OF EDUCATION 



U S OEFARTMENTOF HEALTH, 
EDUCATION A WELFARE 
NATIONAL INSTITUTE OF 
< * EDUCATION 

T...«; nOCUMENT KAS been' PEPRO- 
nuCED EXfCTUY AS RECEIVED FROM 
?HE PERSot OR OROAN.2AT iDNOR.OiN- 
It^Nr »"T 'PO.NTSOF viEwDR 6p«niDns 
ttMED NOT NECESSARILY REPRE- 
SENT OFMCIAL NATIONAL INSTITUTE OF 
EDUCATION POSITION OR PDl IC Y 



4 



FINAL REPORT 



PROJECT NO, 4-1114 
CONTRACT NO. NIE-C-74-0124 



INVESTIGAT'ION 01^ METHODOLOGICAL PROBLEMS 
IN EDUCATIONAL I^SEARCHf-LONGITUDINAL METHODOLOGY 



LARRY R. GOULET' 
ROBERT L, tmi 
MAURICE M. TATSUOKA^ 



UNIVERSITY OF ILLINOIS AT . 
URBANA-CHAMPAIGN., ILLINOIS 



SEPTEMBER, 1975 



U.S. DEPARTMENT OF HEALTH, 
EDUCATION, AND WELFARE 

NATIONAL INSTITUTE OF EDUCATION 



3 



TABLE OF CONTENTS 



Aeknov/ledgements ^ . ^ 

Chapter 1. Introduction 

Chapter 1. .The Study of Behavior Change Over Timj 
Ovefrview 

Chapter 3. General Sampling Strategies for B f(T) 
Research * , / ^ 

Chapter 4. The Determination of the Signdyf icangfe of 
Change Between Pre and Pos^fcestin% 
\ Periods 

-Chapter 5. Vertically Equated Test Forms 




Chapter 6* Applications of the Simplex Model in . 
Longitudinal Studies 

Chapter 7. Constancy of- Construct Validity Over Time 

Chapter 8. Time-Series Analysis Applied to Longi-- 
tudinal Studies » ^ 



Estimation of Time Change: Upper and 
Lower Bounds 



Chapter 9. 

Appendix A. Comparable Reading Test Scores 



ii 
1^1 
2^1 

3- 1 

4- 1' 

'5-1 

6- 1. 

7- 1 

* 

9-1 



♦ 



ACKN0V7LEDGEMENTS 

^ ^ The research team that worked on this project consisted of>^the" 
thjote principal investigators, Larry R. Goulet, Robert L. Linn; and 
^Ma^rice M, Tatsuoka, a research associate, Kikumi -K. ^atsuokav who 
jiDitned tke project in January 1975, 'and three research assist^ts, 
vCrkig Barclay, Jeffrey A. Slinde and Michael Townsend. For mokt of 
th4 duration of the project. Patsy M. Rowland served as the project 
secretary, ^ / A 

■ ' * ' ' ' ' \ 

Each of the principal investigators took responsibility for 
particular problem arenas that werV addressed in this project. The 
areas of primary responsibility are reflected in the^ chapters of' 
this report. The writing responsibility for chapters 2 thru 9 
and for Appendix A is as' follows: . / 

Larry R. Goulet, Craig Barclay, and Michael Townsend 
Larry R. Goulet, Craig Barclay, and Michael Townsend 
Robert ^L. Linn and Jeffrey A. Slinde 
Robert L. Linn and Jeffrey A, Slinde 
Robert L. Linn 
Robert L, Linn .* 
Maurice M. Tatsuoka 

Kikumi K. TatsuQka'and Maurice M. Tatsuoka 
Robert L. Linn ' 

Particular thanks go to Patsy M. Rowland for her care and speed 
in typing several .drafts, as well as part of the final report. Chap 
ters 8 and 9 were typed by Mrs. Joyce Sterner of Technitypists , Inc. 
Urbana, ^Illinois. 



\ 

\ 



ii 



Chapter: 2: 
Chapter 3; 
Chapter 4; 
Chapter 5: 
Chapter 6: 
Chapter 7: 
. Chapter 8; 
Chapter 9: 
Appendix A: 



CHAPTER 1 
INTRODUCTION 



-The basic premise up'on. which this report rests is that the devel-- 
opment and advancement of theory in education, the generation of data 
and theory -dir^eletly relevant to school programs and individual class- 
rooms, and the opportunity to examine complex educational question^ 
await the development of an appropriate ^aethodology. " 'Such a premise 
is similar to that made by George Handler (1967) in discussing con- 
temporary approaches to the experimental study of learning processes. 
He , suggested, for example, that contemporary research on human learn- 
ing emphasises an- "active," rather than a "passive" organism, and a . 
shift to the ^tudy of "complex" processes — without the necessity of 
conducting "complex experiment s.V The latter coup was attributed to 
the development of and advances in our knowlddge concerning research, 
methods. 

Similar types of comments have been made by FJ.s'ke (1973) in' 
discussing the needier process-type research in the personality 
area. He suggests, "theXp enteral but only vaguely recognized need is 
for intensive work on the 1)asic strategy of psychological ^research, 
especially in the personality domain," and further asks, "can we 
study the important psychological processes in the laboratory or 
testing room? How can we be sure of the occurrence of Dhe postulated 
process? Or do we define each specific process simply as that which 
we presume to occur between a- particular stimulus and a designated^ 
type of response." Fiske also suggests that laboratory research, in 
addition to facing problems regarding the replicability of process- 
type phenomena, iaces an almost insurmountable problem^— that of 
determining the degree to which the findings are generalizable to 

behavior ir^ general. " ' 

/ 

Wohlwiai (1973) has also addressed such questions from/ the per- 
spective of developmental psychology. In addressing the qi^estion^ 
whether developmental research belongs in an "experimental* or "difr 
ferential" camp, he suggests, "it turns out that the study, of develop- 
mental change does not readily fit^ either, of the , two modeJ^s, at least 
in their simplest form* On the 6ne .hand, . the study of age changes 
in behavior differs, in certain import;ant respects, from Comparative , 
differential investigations involving other interpersona;L :charaQteris- 
tics, e.g., the study of sex differences. On the oth^r hand, eVen^ 
when development is subjected to direct experirjent^l attack by mani- 
pulating the conditions of experience ift a controlled manner, the 
situation still deviates in some critical ways from that which (An- 
fronts the experii^ientalist deaUng with nondevelopmen^al problems. ^ 
Thus,^ th^pncern with development gives rise to very p^4;ticu3.ar 
requirem^s and considerations as regards experimental m^t^iodology , 
research design, and scientific inference. To put it succin6^y: 



The canons of 'tl)e, scientific method, as ^they have been workeid* out for 
the field of psychology at large, require ^modification when applied 
to d^v.elopmental problems." (16-17) * - - 

- rflvconpents of Handler (1967), Fiske a973), and Wohlwii;. (1973) . 
are'^e^ual^y -s^^opriate to educational research, nc^t" only because of 
the pa^ial'-ayerlap of content across these disciplines, or the common 
call' fo^^ie development of 'nei; methods, but also because each has 
called f or^e,/s'tady^ of the respective phenomena in the environme'ntal 
cdntejcts i n. w;^Qh they occur and because each call^ for the further 
dey^lOpJJienb^Cy^^e^rch methods which provide for direct, unconf ounded, 
and gen4:aliz^ifesestimates of these processes as they change with time. 

^Nfe^irolNAL AND CROSS-SECTIONAL l^THODOLOGY . ' 

Some History of ahd the Interdisgiplinary Character of L6nRit udinal^Research 

As"Sontag (1971) has noted, longitudin^ methodology is by no mean? 
under the exclusive purview of develapmental psychology, fts roots are 
found in a variety of disciplines including, demography and multiple 
Social 'sciences, life sciences, ^nd physical sciences. Yet, he 
suggests that tKe term longitudinal research evokes free associations 
of a "womb-to-tomb" research plan, insIHequaJie research ^design," inexact ,- 
measurement, and an inadequate and inordinately expensive Research 
product. Yet, and^.somewh^t paradoxical, the. longitudinal method and the 
superiority of Yo'ngitudinal data over cross-.sectional,^datai remains 
essentially unquestioned^^in educational 'and developmentral^resear.ch; , 
e.g., Hilton & Patrick (1971). Similarly, cross-sectional methodology 
is seen primarily as a convenient b^it approximate substitute for ^ 
longitudinal measurement. ^ ' """^ — 




The qualms of scientists regarding the uV of Hongitudinal designs 
can be t^raced to a number of relevant problems, fexr example, the use 
"of a. longitudinal design usually requires that the eicperimenter "ag^ 
with his subjects, the fact that the experxii^nter cannot control the 
subjects' experiences between the several times of testing, ~ s\ibject - 
art^trition, and perhaps more important, the fact that the longitudinal 
method commits the experimenter to a specific design and the use of \^ 
specific measurement instruments over the duration of the study. - 

■- ■ Such difficulties have been noted as early as 1741 by Sussmilch who 
also, by the way, commented on the problems of, generalizability of 
using what we now call cross-sectional methods. 

Quetelet (1835) and Galton (188-3) were advocates of the cross- _ 
sectional method, ^yet it was not until the 1920's that the terms , 
longitudinal (Blatz & Bldtt, 1927) and cross-sectional (Gesell, 1925) 
were used to designate the different method^, and it was Anderson , 
(1931), in his classic contribution to developmental methodplogy, 
who affirmed their use as technical -terms. 



Considering the importance and use .of longitudinal and cross- 
sectional methodlology in educational ahd developmental research, it * 
is unfortun.ate, and surprising that comprehensive' and satisfactory 
"di^cussioils of the problem are unavailable in the educational literature 
As an example"^ ^it is' in demography v/heiTe significant advances have 
been made (e.g., IJhelpton, 1954). The lack of consideration of these 
advances in other disciplines is particularly unfortunate since^, as 
one case in point, large-sc^le educational research related to student 
development has borrowed conventional designs only^ from developmental 
psyeh9lQgy_ rather than likely more appropriate adaptations of these 
designs used by -the demographers. 

"Experlbental" and "Descriptive" Designs and Variables 

P$ral^el types of criticisms have been directed to studies 
utilizing longitudinal and cr oss"-sectional s,amgling des^igns. the 
primary, criticism relates to the difficulty in assigning causality 
or the directionality of relationships in such studies (CarapbelL_6t 
Stanley, 1963; Russell, 1957; Spiker, I9&(>y and the inabilib^^to 
subscribe fully to the principles of experimental design when tl^ese 
procedures are used. As an example, chronological age is a biotic 
variable not amenable. to random assignment , replication, etc. Yet, 

oking the principle that only properly randomized experiments 
can lead^to-uisef ul estimates of-^^ausal treatment effects, is a, po- 
tential frap for^ e4ucational rese'S^^ers. As examples, it may liead edu- 
cational researchers to reject one the primary (if not the primary) 

problem in the field r.e.,'the estimation. of the influences of 

educational (e.g. , ' classrooifn) expet"ience$ on performance; it can 
lead to the design of educational research blindly following the prin- 
ciples of experimental design at the expense of the crucial' fpcus — 
the critical -^Mlysis of educational environmetits, and the attendant 
individual-environm^ent interactions. It also encourages "laboratory" 
investigations rather than studies which take place in tlie Idss- 
coilitrolled educational context. .And, it encourages investigat^ions 
where data are collected at 'one time of measurement rather than ^ 
longer-term studies and possible sacrificed in external validity 
'fo^ gain in internal' Validity . 

In addition, th^ costs, in terms gf time and money are indeed 
pr|ohibil:ive when "experiments" are conducted XRubin, 1972). This 
is true since 4t is impossible to perform equivalent experiments td 
te'st all treatments on even a single eddLicational question (e.g., 
e-xWining 100 reading programs). And, the above argument has not 
included the argument that' the exclusive use of experimental variables 
precludes the study of, certain educational questions or that rnadom 
assignment cannot be ethipally used as a procedure in certain types of 
studies, . » , , 

Several of the questions and issues discussed above relate to 
questions of research design and methodology and are addressed in Chap- 
ters/ 2 a^id 3. 



" 1-A 

Specific ^lethodological Problems in, LongitudiiYal Research 

Longitudinal gtudies confront numerous difficulties, only a 
fraction of. which'were addressed v^^thin the confines of this project. 
A vari<^ty of issues involved in the measurement of change are considered 
in Chapter '4. • Of particular concern in Chapter 4 are di'f f iculties 
•caused by characteristic^ of scal'es commonly used for standardized 
achievement tests. ^ 

Studies, whether longitudinal or cross-sectional, which focus on 
student achievement over a period of several years typically require ■ 
different' measures of achievement at different gradesoor ages. In 
order to make comparisons of a'cliievement- oyer time such tests must ^ 
be put qn a common scale, i.e. they 'must be vertically equated. In •, 
Chapter 5 the adequacy of the vertical equating -of some existing 
standardized achievement tests is investigated and a study exploring 
the potential .utility- of the Rasch model .for-, the vertical equating 
problem is reported. - - 

Several attempts at using analytical techniques 'developed by Jores- 
kog fq^ the analysis of covariance structures are discussed in Chap- 
ters 6 and 7. In Chapter 6 the 'focus is on the fitt of several sets, of 
data to -a Simplex model and in Chapter 7 the focus is^ on the use^f .; 
these techniques .to evaluate the constancy of constructs, over tome. 

Time-Series Analysis in UdngitudinaO. Research ^ 

'Er^ its very name, time-series analysis seems to be, a technique ^ 
' especli^ suited-to longitudinal research. A 9asual study of its 
methodology reveals, however, that--as traditionally conducted~it 
is applicable more to sequential cross sectional " research. In _ 
Chapt"ex 8 we first present ati elementary e^osition of time-s^ries 
analysis then indicate the difficulties in applying it t& data from 
longitudi^ studies as ordinarily conceived, and finally proppse a 
new m'^l^ foKestimation oE parameters in time-series 'models that is 
especiait^^d^^^^ivjto longi'tud'inal data. 

In brief\tl^^ai^i'?SXti^^^^^ the traditional procedures for 
par^met'er-estlmkxion i^^Sfe-series analysis are that (a) they require 
a large number (>\0) of td^ft«^int observations, and (b) they ignore 
the correlatednels^ individulT^iata across^ time. A procedure which 
avoids these difficuki.£s is propos^ and successfully tested by means 
of two numerical, examSAc^ bas^d^n real data and the othef using 
simulated data. 

Measurement cff Change 

The time-honored 'problem of ' ifteasurement of time, change is 
' revisited in Chapter 9. Difficulties with the traditional assumption 
of "universally uncorrelate(} errors" are discussed ;Ln this context, 



9. 



ai;id a relaxed assumption of "homogeneity of error covariances" is 
proposed. Under we latter assumption, lower and uppfer bounds for es- 
timated time change are derived, utilizing the mathematics of operator 
^analysis • 

•An example ba&ed on real data is ^presented, .and it is sho\'m that 
the uncorrela'ted-errors 'assumption leads to an absurd result (a 
multiple-R greater than unity), while the relaxed condition yields 
reasonable and useful bounds. 



' , , REFERENCES 

Anderson^^Vs E . The methods of child prsycholo&y. In C. Murfihison 
Ed.), A hkndbook of child i^sychology . Worcester: Clark 
University Press, 1931. 

Blatz, W. E., & Blott, E. A. Studies in mental hygiene: I. Behavior 
of public ^chbol. children — description of method. Pedagogical 
Seminary , 1927, 34, 552-582. • ' 

-Cainpbell, D* T., & Staaley, J. C. Experimental and quasi-expeVi- 
^^ental designs forY^earch on teaching. In N. L. Gage (Ed.), 
Handbook of research on teaching . Chicago: Rand McNally, 
1963^, 171-246 . t ' ' 

Fiske, I>. W.^ Research on '^Psychological processes with particular 

reference to personality. . In S; B. Sells (Ed:), Needed research 
on psychological processes .^ V/ashington, D. 'C, U. S. -Office 
of Education, 1973. 



Gal ton, F. Inquiries into human faculty and its development . London: 
MacMillan, 1883. 



Gesell, A. L. The mental grovth of the pr^-school child; A psycho- 
logical outline^ of normal development from birth to the sixth 
year, including a sy.stem of developmental diagnosis . New York: 
Macmillan, 3^25. 

Hilton, T. L., & Patrick^ C. Cross-sectional versus longitudinal- 
data:, * An empirical comparison of mean differences ^in acadeiijic 
\. growth. Joqrnal of Educational Measurement , 1,970, 7_> 15-24. ^ 

Handler, G. Verbal^ learning. Jn G. Handler, P. Mussen', N. Kogan, 
' dnd M. A. Wallach, '(Eds.) New Directions , in Psvcfiolog:^: III . 
New York: Holt, Rinehart s^d Winston, ;L967. ^ 

QuBtelet^, A. L. Sur I'horome el le developpement de ses facultes . 
Paris: Bachel^er, L835. 

Rubin, D. EstimaCinR causal effects of treatments .in experimental 
and observational studies . Princeton, N. J.: Educational / 
Testing Service, Research Bulletin 72-39, 1972. 

R,ussell, A. An experimental psychology of development: Pipe 
dream or possibility. In D. B. Harris (Ed,), The Concept of 
Development . Minneapolis: University of Minnesota Press, 
1957, 162-174. ^ 

Sontag, L. W. The history of longitudinal research: IiQplib^ions 
for the future. Child Development , 1967, 42, 987-1002, 



^ 1 i 



* ' V 1-7. 

Spiker, .C. C.' The-concept of development^*. Relevant and irrelevant 
" issues. In. Stevenson, H. W. The concept of development . V 
Monographs 'of. the Soclo^v for Research in Child DevelopinenC , 
1966, 3d. No. 2 (Serial No. 107), A0-S4 ^ ^ . 

Whelpton, P. K. Cohort fertility '(native white v;o raen in the United . 
States) . PrjLnceton;. University Press, 1954. .■ 



Wohlwill, iJ. - Thei study of behavioral tfevelopment . TJew .York:. 
Academic P'ress, 1973. 

/ 




. 7 



/ CHAPTER 2 

^ THE STUDY OF 3EHAVI0R CHANGE OVER T.IME 
OVERVIEW. 



The study of time-related behavior change comes in varied forms. 
To the developmental .psychologist, suqh a research focus most typically 
implies the study ofs behavioral development^ For the sociologist, such 
a purpose more likely w^uld imply the^ study of social or sociocultural 
change. "^^The educational researcher is concerned with Ba^h of th§se in 
a very direct way. We a.re concerned with how the population of school 
children changes across time, e.g., years or decades, and with the per-, 
formance changes, of specific groups of children as they pass through 
successive school grades. The first two purposes notwithstanding, 
the educational re/rfearcher is often con£r;onted with a third and more 
specific question(/'i.e. , the^assessment of the influences of schooling 
or e4ucational intervention* The differences, similarities and inter,-, 
relationships among these various research Questions are discussed 
in detail in. various sections of Chapters 2 a^d 3. It is our intention 
to examine various research designs and theoretical models which fit 
su'ch .tmestions . Several theoretical assumptions which underly these 
que§iich:is are also examined most specifically as they apply to longi- 
tudinal methodology. " These questions and designs are discussed in 
^his chapter i*n the- context of conventional ppcocedures and sampling 
methods. Modif idations and extensions of such designs proposed by ^ 
Bell„ (1953), Schaie (1965) and Baltes (1968) ate presented and dis- 
cussed. In Chapter 3, general sampling procedures are pr-esented- 
which can be adapted^&^ the 'theoretical model, and assumptions adopted 
by the researcher. ' . . 

Many of the inferences o'f ^ this paper rest on the assumption that 
educational researcfi^like developmental research, can be described ' 
by problems which take the form: 

' B = f (T)^ ^ ^ - ' • 

where "B" refers to the behavior or behavior changes to the studied, 
and "T" ^refers to the time period over which the assessments are made 
(Baltes'and Goulet, 1971). As will be sHown, most designs used in 
educational research can be described by the above paradigm even though 
'they represent only the simplest cast of a more general model for 
research concerned with changes in behavior associated with time. 
These research designs- are discussed and their limitations in the 
context of educational research are noted in the next section. 



1 J 



SIMPLE DESIGNS FQR EDUCATIONAL RESEARCH 



Schaie (1965) has noted that the paradigm B = f (T) described 
above spawns three alternate research designs, generally known- as 
the cross-sectional method, the lort^itudinal method, and the time- 
lag method. These three designs differ in terms of the procedures * 
used to draw the samples of interest and the time period over which 
measurements are taken. With the cross-sectional design, for ^ 
example, samples of different ages are tested at»the same, point in 
time. As will be shown, such a design has^ limited usefulness in edu- 
cational research. The longitudinal method requires the testing of 
samples with the same birthdate (or alternately samples who are in 
the same school grade) at different points in time. Suvh a design - 
is pe.rhap's the most popular of the three in^ the context of educational 
iresearch since the ^children can be followed over periods of time when 
they are enrolled in school. ^ 

It is important at this point to mention that the longitudinal 
design is amenable to both between-S and within-S (i.e., repeated 
measurement) testing procedures. As mentioned above, the basic 
requirements of the longitudinal design are met if ^s with the same 
birthdate are tested at two or more points in time. This may be 
accomplished through the repeated testing of the same sample of ^s; ' 
i.e., a within-S longitudinaf design. With a between-S longifu^inal 
desigi^ sample's of Ss can randomly drawn from, a population born 
within the same perTod, with each sample being assigned to testing 
at one of the times of measurement represented in the investigation » 

The tfme-lag design, the least used in educational research, 
yet perhaps the most powerful of the three designs for educational 
purposes, requires 'the testing of samples with different birthdates 
at the same chronological age. This, of course, requires testing the 
samples in the order in which they are born; 

These^ three designs are represented in Figure 1, with the cross- 
sectional {Xs) design conforming to the vertical (cross-row) compari- 
sons, the longitoidinal (Lo) design^ conforming to the horizontal 
(cross-rcolumn) comparisrons, and ,the time-lag (Tl) design conforming 
to the diagonal comparisons. As Figure 1 also illustrates, a par- - 
ticular sample of S^s is fully described^by three components., date of 
birth (cohort), age, (A) and time of testing (Schaie, 1965). Note, 
however, that the sampling model described in Figure 1 makes rio ref- 
erence to the level of educational attainment (e.g., school-grade) 
of the respective samples of subjects defined by the model. It is 
apparent that^ any prototypic design for educational research must 
provide for the estimation. of such a parameter and this is discussed 
in later sections of the chapter. However, at this point it is most 
relevant to contrast the three alternate designs as they incorporate • 
this parameter into one of the three already described* 



it 



Time of Measurement 



Cohort 



TV 



CO 

X 



3 
Lo 



"3 

\ 

I 

> 



5 



/ 

4- 



Figure 

Simple Designs for Educational Hesearch 



15 ' 



- ^ . ' 2-4 

' Educational Attainment and the Crpss-Sectional Design 

Studies coTKerned with educational phenomena and utilizing' the 
cross-sectional sampling prooedures implicitly or explicitly incorporate 
educational experiences as part of tfie age component. Examples are 
studies where the samples^ of Ss tested differ in CA by a minimum of 
V one year or a minimum of one, school grade. As is apparent, sucii a pro- 
cedure yields results which confound amount of schooling and other com- 
. ponents of CA-related behavior change and, thus, the effects of educa- 
'tional experiences can be estimated only in conjunction with, these 
other 'factors, ^'ur the rmore, the cross-sectional method requires the 
added assumption that the effects ,of schooling for children in com- 
parable grades are the same irrespective of the 'year in which the , ^ 
children are enrolled. Thus assumption is similar to that made in, 
developmental research; i.e., that measures of performance utilizing 
cross-sectional sampling procedures will, provide results identical to 
those involving longitudinal sampling procedures (Wohlwill, 1970). 

Similarly, within-^rade cross-sectional contrasts (where between- 
CA contrasts are made for Ss in the same grade) have little use in 
educational research since""this design does not provide for variation 
in the educational experiences of the samples. 

Educational Attainment ar>d the Longitudinal Design 

The longitudinal design ^Uffe'r^'"^ f rom the same limitations as the 
cfpss-sectional method, except tha^ the limitation halds .when both 
Within- and between-grade contrasts are made. Again,' amount of school 
' experience and other CA-related influences on bettayioxal development 
are* inextricably correlated. In fact^ the .case has,Jbeen made (Goulet, 
Williams, & Hay, 1974) that, ■ because jil the confounding of CA-related . 
and school-related influences on development, the longitudinal method ^ 
will normally provide estimates of behavior change which' exceed those 
invplvjLng the cross-sectional method when within-grade cahtrastis are 
made* , . . , . , 

Educatioilal Atta:^nment and the TimB-Lag ^Method ! ' ' 

In contrast* to the cross-sectional and lo,ngitudinal methods, the 
. use of* the time-lag methods, perhaps more 'properly identifies school 
expef'ience with the time-of-testing component in Fi'feure 1. The* use 
of this /design in educational research, although somewhat limited by 
the age-graded nature of the schools, nevertheless permits both With- 
grade and between-grade contrasts to be made for samples of varying 
CAs. The design capitalizes on two simple facts; i.e., that children ' 
withih a grade differ in CA, and that the CA of a sample of S^s increases 
over the perioc} of a school year. Thus, in reference to Figure. 1, if 
testing takes place in October and April within the same academic 
year, it is possible to contrast matched CA samples within a grade 
(e.g., at age A^) or between matched-CA samples in adjacent school 



' ■ ' 2-5 

« * * 

grades (e.g., "'at age A^) . Such contrasts permit the BStimation of 
the effects of School experience's independently of other CA-related 
factors. 'Aia major limitation of usdng the time-lag method is that 
the contrasts may only be made for £s in adjacent grades or for within- 
grade contrasts. Nevertheless, many such contrasts can be made. 

It is apparent, that the use of a cross;^ectional sampling strategy 
is inappropriate when .the purpose of the researcher is in assessing 
education-related .performance changes associated with time. The dif- 
ficulty is further^ compounded when it is taken into consideration that 
cross-sectional differences in performance are as likely attributable . 
to population (i.e., cohort) differences as to age differences. Bell 
(1953) and Kessen (1960) have each noted this ;possibility and have 
advocated the use of longitudinal sampling whenever population differ- 
ences/changes are a possibility. However, longitudinal sampling, where 
Ss are rjepeatedly, tested, suffer from potential contamination d^ue • 
to repeated 'observation, attrition, -etc. Longitudinal measurement 
also'^akes time" since th^ researcher mUs/wait between successive test- 
ing periods. In addition, it is evidenj/ih Figure 1 that /longitudinal 
changes 'in peifformance may be attributable to factors associated with 
age, time-of-testing^, or both. 

Bellas Convergence Method ^ ; ^ - 

Such difficulties in interpretation of the B = ^f(T) functions have 
led to several suggested modifications of the above sampling procedures 
The first of these was presented by Bell (1953) and called the Con- 
vergence Method. A prototype of the Convergence Method is presented 
in Figure 2. * 

Figure 2 describes four samples of children (cohorts 1962, 1964, 
1966, 1968) .each tested in three consecutive yearsi^(197A , 1975, 19.76) 
and involves combining the longitudinal and cross-sectional sampling 
methods in such a way that "developmental changes for a long period 
may be estimated in a much shorter period (BeJ.1, 1953, p. 1A7)." In 
other words, tfhe age function from 6-14 in Figure 2 can be described 

''by using three testing points (spanning .a two-year period) „for each 
of the four cohorts. The overlap ir^ OA f6r the successive cohorts^ 
(e.g., cohorts .1968 and 1964 are^each tested at the age of, eight) is 

' built into the design in such a fashion as to permit the possibility: ^ 
of assessing population differences. In other words, in the absence of 
performance differences across different cohorts matched on CA, ^ ' 
Bell (1953) suggested that the longitudinal function estimated using 
the convergence method would overlap with the longitudinal functioti 
which would have been obtained if the 1962 cohort would have been 

. tested at the age of six and yearly thereafter. 

Bell's (1953) Convergence Method was suggested as an alternate 
• sampling prbcedure (replacing longitudinal or cross-aectional methods) 
to reduce some of the difficulties Associated with Ipngitudinal. samp- ; 
ling. Implicit in suggesting the method was the suggestion "that 

I 



2-6 



longitudinal sampling was clearly the method of choice when the 
purpose of the r.esearcher. is to describe developmental-age function^ 
for a specific cohort or population of subjects. 

^Furthennore, Bell clearly anticipated .recent refinements in longi- 
tudinal methodology by suggesting that combinations of longitudinal 
and cross-sectional sampling' have raer.its which clearly exceed those 
using either sampling method alone. And, his suggestipns h^ve been 
tacitly accepted by Schaie .(1965) , Baltds Cl968),Buss (1973), Goulet, 
Hay, & Barclay (1974), in receat papers which' have had the primary 
purpose of identifying the components of tine-related behavior change. 

SEQUENTIAL METHODOLOGY 

' Schaie (1965) has criticized the available sampling methods and has 
suggested that longitudinal and cross-?,sectional methods are only special 
cases of a general model for research on behavior ghange o.ver time. 
He argue4 that' petf oraance is a function of three factors, the age (CA) 
of the organism, -the cohort (C) , to*which the organism belongs, and 
the time (T) at which'measurement occurs, i.e., R = f (A,C,T). A 
cohort, according to Schaie (1965) refers to the population of organisms 
born at the same point ar interval in^.time. In short, Schaie (1965)^ 
suggested that differences associated with age which are obtained using 
longitudinal and cross-sectional sampling procedures would accurately 
reflect behavioral (development (andf provide Identical estimates of age- 
related befiavior change) only.ff there .^^re no population^ (i.e., gen- 
eration) or environmental (cultui:e) changes over time* In 'the absence 
of -jBvl^ence to the contrary, cross-sectional differences in performance 
must' be assumed to reflect the combined influences of developmental 
(i.e., age) vand population (i.e., cohort) changes associated with time. 
Similarly, longitudinal differences in performance reflect influences 
of age- and time-of-measurement-related factors. > 

In view of the potential conf ourtding , Schaie proposed a model for 
the conduct of developmental rese^^rch which provides *the opportunity to 
examine the influences oi each of ^ these components o,n performance. The 
general model generates three different sequential research designs 
which permit CA, cohort, and time ofi measurement to be simultaneously 
'varied, two at a time. The general -mo clel is summarized in Figure 3. 

As'Figure 3 indicates, samples of Ss representing five levels of 
age .and nine cohorts are tested at five times of measurement. Between- 
row dontrasts represent (ionventiorial cross-sectional (x-s) comparisons. 
Diagonal coht»rasts. conform to a time-lag (Tl) design, and those between^ 
col\^. comparisons represent longitudinal (^o) contrasts. As is 
"apparent, cross-sectional comparisons confound age and cohort dif- 
ferences, longitudinal cotiiparisons confound age and' time of measure- 
ment differences, and time-lag comparisons confound cohortj^^ time-of- 
measurement differehces. In view of such confounding, Sc^ti|^(1965) 
suggested the use of sequential sampling designs which separate sources 
of variance associated with the three' components . Tmis., a cohort- 



i n 



2- 



Cohort 



Time of Testing -si 





1974 


1975 


1976 


1968 




7 


8 • 


1966 


,8 


• 9 




1964 


10, 


11 . 


,12 


1962 


12 


13 ^ 


14 



Cell entries refer to CA at the time-of-testing 



Figure 2 
Bell's Convergence Method 



f ^ 



'AS; 



ID 




r 



2-8 



1962 



1963 '8^ 



1964 



1970 



1965 6 
Cohort 1966., 5 
1967 
1968 
. 1969 



'1971 



7 
6 



Age 



8 

7 

6 

5 



9 
8 

7 
6 

5 



Lo — ) 

1972 1973 1974 

Timf of Testing ;^.f' 



x-s 
I 



8 
7 
6 

1975 



Figure 3 ' 

A Prototype of Schaie's General Developmental Model 



ERIC 



sequential design, represented by samples b-, c,,e, and f , in Figure 3 
provides an estimate of age differences controlled for cohort dif- 
ferences and for cohort differences controlled for age. I Similarly, 
a time sequential design represented by samples a, b, c,i and e in 
Figure 3 provide for estimate of age differences with tiime of measure- 
ment controlled, and for time of measurement differences with age 
controlled. The cross-sequenXial^design, represented by samples b, 
c, d and- e in Figure 3, provide for estimates of cohort! changes uncon- 
founded by time and far time differences unconfounded b|y cohort dif- 
ferences. Schaie (1965) suggests further that a sampling plan con- 
forming to the example provided in Figure 3 provides tljie opportunity t 
assess the independent effects^ of each of the three components with 
a mi(.nimum'of six samples of ^s, e.g., samples a, b**»g/ in Figure 3. 

The primary ability of Schaie 's model is that it|provides methods 
for separating sources of developmental change. That; is, unlike the 
cross-sectional method, the use of the cohort-sequential design pro- 
vides the opportunity to examine age differences in the absence of 
confounding with the cohort variable. Similarly, the time-sequential 
design, provides the possibility of identifying age-related effects 
without the confounding of time-of-neasur ementt (as with the longi- 
tudinal method)'. 

Nevertheless, the model as discussed to this point remains exclus 
ively descriptive and nb theoretical meaning can be asctibed to either 
age, cohort, or time-of-tes tang effects obtained^ when using the model. 
Schaie (1965) has, io^^ct, suggested that the three components are 
subject to theoretical interpretation'that' is, age differences esti- 
mated from the model may, according to Schaie be interpreted aS the 
"net effect of maturational change," time differences as "net changes 
within the -environment" ^nd cohort effects as "net changes between 
generations" (1965, p. 9^np ScKaie suggests further that these effect 
may be estimated simultaneously, whenever^ data are available which con 
form to the general model; e.g., the^^e^ samples (a - g) in Figure .3. 

The theoretical interpr^-fe-d^ions of age, time, and cohort effects 
proposed by Schaie have evc5lced considerable controversy (e.g., Baltes, 
1968; Buss, 1973; WoljJrTmi, 1973). Baltes (1968), for example, has 
suggested *^that tlpe^^ree components, age (A), Time (T) and Cohort (C) 
do not .exist iff^pendently of one another; i.e., that Schaie's model 
can be de^^efibed adequately by two rather than three components. In 
other words, once two of the components are specified, the t?hird is 
^unequivocally fixed. This fact can be demonstrated by recourse to a 
simple example; i.e., that the cohort for any particular sample of 
Ss may be determined by subtraction of age (years) from the time of 
measurement; i.e., 

C « T - a' ' 2.1 . 

Similarly it can be shown that the following two relationships exist: 



2 1 



2-10 

T = A + C 2.2 
A = T - C 1 2*3 



Baites (1968) suggested thaf the existence of the mutual dependen- 
cies reduce the model to a bifactor rather than a trifactor model and 
that one of the components in Schaie's (1965) formula, R = f (A, C, T), 
can be replaced by substitution. As an, example, the substitution of 
A + C in formula 2.2 Schaie's formula becomes R = f (A, C, A + C). 
Further difficulties relating to the theoretical interpretation of 
B =-f (T), phenomena are discussed later in this chapter. 

. It is, however, important to consider the implications of Baites' 
suggestions as they relate to research methodology and the adequacy of 
available sampling methods. First, in the absence of the possibility 
of f'Uhcticinally separating age from time-of -measurement 'effects, or 
cohosift from time-of -^measurement' effects , the longitudinal and time- 
lag S^pll|ng metho'ds immediately become (contrary to Schaie's sug- 
gestions) accepJtable research designs for the study of B = f (T) 
phenomena. These two designs are only limited in their generaliza- 
bility; i.^., longitudinal data collected qu a single cohort provide 
'*tr.ue" estimates of age-related develo|>inent fox the cohort and. time 
interval ^b4ing sfudied.*^ Similarly, p(^rf ormance differences estimated 
using the t|ime-lag method provide true estimates of cohort-related 
change for Ithe ages* and time interval being studied. Only the cross- 
sectional sjampling method is unacceptable since it confounds age, and 
cohort -effects . 

The problem 'of generalizability is also reduced, according, to 
Baites (1968), if the longitudinal method is supplemented by: (1) 
obtaining longitudi-nal measurements for more than one cohort; i.e^. 
by using th^ cohort-sequential design (pr what Baites calls longitudi- 
nal sequences); or (2) obtaining cross-sectional measurements across 
several times; i.e., using Schaiels time-sequential design (or what 
Baites calls I longitudinal sequences). Most important, both Schaie and 
Baites recommend the use of sequential designs for the stud^ of B = f (T) 
phenomena and^their use is most strongly recommended here "whenever th^ 
intent of thei researcher is to obtain acceptable (and geneifalizable) 
estimates of age or generation effects.^ It is apparent, hpyever, that 
tha estimates 1 of 5 = f (T) phenomena using sequential methods remain 
descriptive arid subject to differing theoretical interpretations. The 
Baites (1968), land Schaie (1965) controversy^ is a case in p^int< 

SEQUENTIAL ^t)ESIGNS AND EDUCATIONAL RJIIeARCH ^ j 

The above .discussion has highlighted the dif f icul^ti^s in using con- 
vent ial sampling methods in research oriented to the assessment of the 
influences of educat^ional experiences. This discussion^ leads to three 
questions concetned with these problems. 



2-ir 

'1. Do children of the varying CA's enter a schpol grade with 
varying proficiency? 

2. WJiat are the non-CA-related influences of schooling? 

3. • What ±s the nature of the interaction between amount of 

schooling and CA in performance? 

Unfortunately, none of the above designs previously discussed 
provide information concerning these questions. Nevertheless, it is 
possible to sdt tip a sampling procedure which when^used, permits these 
questions to be addressed directly. Figure A prc/vides a prot6type of 
^ such a sapling plan. In the figure, samples of S^s A^arying in CA 
(A- , A^»-» •Ao) : .-amount of schooling (S- , S^,*^*S-,), and school grades 
are tested at different points in time durihg the period of a school 
year and permits cross-sectional contrasts (between-row comparisons) 
longitudinal contr'asts (diagonal comparisdns) and time-lag contrasts 
(between-column comparisons). * 

The cross-s«ctional contrasts (relevant to question 1 above) pro-*' 
vide comparisons of performance for samples of children varying in CA 
but wlio have had the same amount of formal school ing.^ Fpr the time- 
lag contrasts (relevant to question 2 above), the comparisons ^re 
for samples matched on CA who vary *in amount of schooling. The longi- 
tudinal contrasts, (where samples of S^s born during the same period 
are tested at different points in tfhe school year) inextricably con- 
found CA and amount of schooling. Fortunately, the c-ross-l inking of 
, appropriate samples (as exemplified in Figur^ A) permits comparisons 
^which provide information to be collected r^arding each of the above 
three questions in the same analysis. For example, statistical con- 
. trasts . involving samples a, b, c, and d in Figure 4 permit the 
behavior changes related to the first four months of schooling, CA, 
and their intexacjtiaa_J:a be estimated for children in first' grade. 
An. analysis -involving samples d, e, f, and g from Figure 4 permits 
Similar comparisons for the last four months of the school year. 
Finally, an analysis involving samples g, h, i, and j from Figure 1 
permits educational growth during the latter part of first grade 
an"<d the early part of second grade to be estimated. Each of the 
statisticlil analyses outlined above represent simple 2x2 factorial 
, designs with CA and time of testing as the two factors. Furthprmore, 
each analysis permits two independent assessments of the influence^ 
of schooling (one at each of two levels of CA) and two estimates of 
the relation between CA and performance (one at each of two times of 
'testing in the school year). Additional discussion of th,e statistical 
analyses which follow from the use of the sampling plan In Figure 4 is 
presented later in this paper. ' However, it is .important* at fhis point 



Time of Testing 

« 

Sep^t. Jan.- llay ' Sept. Jan. 



Grade 1 



Grade 2 



6-2 'a S, 



6-6 S^^ A^ 



6-10 A3 S^^ A3- A3 



Chrono- 
logical 
Age 



7-2 



7-6 



V^2 



A3 sl 



7-10 



^6 V h h h 



8-2 
8-6 



A^ S3 A^- 



^8 ^7 



Figure 4 



Sequential Sampling Procedure for Educational Research 



to note that the samples of S^s represented in the present model axe 
independent groups. Thus all comparisons conforming to cross- 
sectional, time-lag or longitudinal designs are based on between-S^, 
(as opposed to within-_S) comparisons. Such contrasts may be made 
across thp entire period of formal schooling and, interestingly, data 
conforming to the sampling plan in Figure 4 and spanning several school 
grades may be collected over the period of a single school year (e.g., 
1975) or multiple school years, e.g., 1975, 1976^*** 

Descriptive Uses of the Sampling Procedure in Figure 4 ^ 

The sampling procedure outlined in Figure 4 was developed on the 
premise that research designs and educatibnal research methods must 
serve both analytic and descriptive purposes. In an analytic sense, 
the use of the above sampling procedure for either within- or betwedn- 
grade contrasts petmits the independent influences of schooling and 
other CA-related facto?:s to be estimated. However, the above sampling 
procedure has an added utility, that of permitting amount of schooling- 
performance functions to be generated in much the same manner that CA- 
performance functions are generated in research 'concerned with 
developmental phenomena. 

That is, the use of the sampling procedure outlined in Figure 4 
permits the cumulative influences of schooling to be estimated across 
grades. Such a schooling-performance function would be represented 
by adding the differences in performance for matched CA samples over 
different times of the school year for S^s in different grades; i.e., 
the estimate of the influences of schooling for the first educational 
period would be represented by:' - + (X^ - -X^) or 'by 

- ' ~ : ' 

+ - (X^ + X^)» . The estimate of the educational experiences for 
' 2 • 

the second period of schooling would be represented by X + X - (X, + X 

' 2 

The cumulative inflTiIences of schooling across educational periods and 
grades would be represented by pooling the estimates across these 
periods. This sampling procedure also permits CA-perf ormance functions 
to be estimated independently of the :|.nfluences of schooling. This 
Woiird be accomplished by pooling performances differences for samples 
varying in CA who have equivalent educational experiences; elg.. 




' 2-14 

U 

Within- and Between-Grade Contrasts in Educatipnal Research 

It is emphasized that all educational problems an^ issues do not 
require a sampling plan as elaborate as that^ specified in Figure 4. 
In fact, most research problems probably require. that only a sectibn^of 
the total sampling plan be *used. Such a determination miist he made 
by the individual researcher after taking into consideration the \ 
nature of the research probJ.em, past empirical findings and ft^e dieo- 
retical model or hypotheses to be investigated. However, i% is of 
interest to note some of the additional phenomena which may "be studied 
when within-grade (contrasts and/or between grade contrasts are made in 
conjunction with the above sampling plan. For example, within-grade 
contrasts would be especially appropriate when the researcher is 
interested in cros'fe-seasonal behavior changes in the children. For 
example, the amount of time spent in study may vary with thfe season 
of the year or the proximity to important holidays (e.g., Christmas). 
Similarily*, between-gr^d contrasts for matched-CA samples at the end 
of one grade and the beginning of ahother may provide information 
concerning the (nan-^A related) impact of changing school grades on 
children's^ behavior. 

CA, ALTERNATE DEVELOPMENTAL SCALES |^ 
AND RESEARCH METHODOLCCY 

There has recently been considerable controversy and discussion 
concerning the role and use of CA. in studies concerned with describing 
the nature and course of behavioral development (Baltes, 1968; Baltes 
& Goulet, 1971; Bijou, 1968; Birren, 1959, 1965; Goulet, 1970, 1973; 
Kessen, 1960; Ne,ugarten, 1968, 1973; Neugarten & Datan, 1973; Schaie, 
1965; Wohlwill, 1970, 1973)'.' However , most of these papers have been 
concerned with the limitation^^of ,\QA rather than considering th'e^role(s) . 
tjiat it - does play in developmetttal Vlnquiry ♦ Fiirthermore , the general 
concerns regarding the limitations'*"df CA as a variable in developmental * 
research are shared, Tbut the reasons for 'this concern vary widely. The / 
present sections represents an attempt to classify, the various uses and 
' limitations of CA from different theoretical perspectives, especially, 
as they relate to attempts to identify developmental (as opposed to 
generation-related or secular change-related) changes in behavior. 

Age Scales and Development 

Kessen •s (I960)., statement defining* the subject matter of developmental 
' psychology provides an excellent base from which to describe the^ various 
uses of CA in developmental research. He proposed: "A characteristic 
is said to be develapmental if it can be Velated to age in an orderly or ^ 
lawful way," (p. 36).\rApart from occasional and periodic Reminders that; 
age does not qualify as an experimental variable (e.g., Baltes, 1968), 
the functional statement R (response) = f (Age) has been generally 
accepted [ev^nywith its limitations (Birren, 1959; Wohlwill, 1973)], by 
most developmentalists as defining the subject Matter of the field. 



0 • " ^ 

• VJhile not rejecting the iinportance^of CA as an index of behavioral 
ch^itge, Neugarten and Datan (1973) suggest, *^It is a truism that 
chronological age is at best only a rough indicator of an indiyidu- 
al's position oxi any one of numerpus physical or psychological dimen- 
sions. The significance of a given chronological agie»»»when viewed 
from a sociological or anthropological perspective, (^s a direct fun- 
tion of the social t»definition of age." Similarly, Baer (1970) sug- 
gests that CA is used raulier grossly *as, a cataloging device in order 
to manage the apparently unmanageable diversity and heterogeneity 
which exists among children. His comments highlight a number of 
important elements regarding the use of CA in developmental researchr 
ers. We suggest that ,the conventional methods of subject selection 
and faatching an developmental research rarely consider the "point 
of origin" as a nominal property. Rather, the major concern is to 
describe and' explain the behavior ' changes or differences which occur 
across time for selected populations. For example, researchers using 
S^s enrolled in school typically select and differentiate samples by 
-school grade rather than chronological age. The CA range of the 
children within a specific school grade, however, typically meets 
or ^xce^ds 12 months. Thus, even though the average difference in 
CA for Ss selected from successive grades will approximate 12 months 
(as the metric of time) the use of birth as a functional defining 
characterl^t-ic has been Sacrificed. 

Similar conventions, exist in the literature concerning adult 
development and aging where the performance of S^s falling within 
« specif ic CA ranges, e^g., 26-35, 36-45, A6-55, etc.,^are compared. 
'Again, such a convention maintains equal time (or age) intervals 
between successive groups but scarifices the point of origin as one 
of the formal characteristics 'of a CA-based scalfe.of development. 
In other vjorcjs, the concern of the researcher has been to describe 
the developmental changes which occur across the time or age range 
included in the study using the developmentally "youngest" sample ^ 
ftnr comparison. One possible reason for this is .that developmental • 
and educational research does not, as yef, require a high degi^ee of 
precision in matching variables (e.g., Baer, 1970). However, a 
central premise of this paper is that matching criteria are imjiortant 
since different uses of the point of origin serve as convenient cata- 
loging devices to differentiate apiong yarious "types" of developmental 
research. ' ^ 

■ - ^ ^ , i 

Three UsQs of CA in, Developmental Research 

I 

1 

^ Wohlwill (1973), Baer (1^70) and others sugges^ that CA, as an 
index along which to measure behavior change can be usred as a purely 
descriptive (and thus causall^* neutral) scale. We suggest that such 
'a position is appropriate only if the point of origin (e.g., birth) 
is disregarded as a functional characteristic in developmental inquiry. 
In otber words, if time sihce birt'h is functionally irrelevant, then 
the only* operative characteristic is the metric of time (in this 
case calendar time). However, a developmental ' scale must involve 



. 2^16 

both nominal characteristics , ^T.e. , poin(_or_or igin and metric of 
time. Chronological a^j^ is no exception, l^^len XA is used as an index 
of .development the investigator accepts birth by fiat as a signi.fi- 
cant life event against which to des^cribe the course of behavioral 
.development. Furthermore, birth, as a point of origin, specifies 
the manner Ln which S_s are to be matched or differentiated as tc5 
level of development. 

A second use of chronologidal age by developmental researchers 
has been aptly discussed by Birren (1959) and Wohlwill (1973). ' 
Birren (1959) Suggests that the aging process takes three forms; 
biological, ' psychological and social aging. Biological aging desig-*- 
nates the position of the individual along his/her natural life span 
in ordinal units. Psychological aging refers to the* achievements 
arid potentials of the individual. Social aging refers to an indi- 
vidual's acquired social habits and status — a composit of the ^ 
Individual's performance in social rples . • Birren acknowledges the 
substantial degree of overlap between these three *'types" of aging 
but suggests that these are tl^e most likely candidates f or .alterriate 
age scales. Since these scales currently fio not exist, CA is used 
as a convenient substitute for underlying biological," psychological 
or sociological processes and is assumed to correlate with each of 
them. Given that CA is used as a measure reflecting some underlying 
process, several assumptions have to be made': first, the ''point of 
origin*' of the process must be correlated with biuth, and; second, 
a linear relation exists between the underlying process and CA at 
least oVer the ages or period of interest. 

The third form of a CA scaj-e majj^ be designated as a state or 
stage scale. Such a scale may take different forms, but the defining 
characteristic is '^hat a particular period within the life-sp^n of 
an individual is charsted by points (designated/by CA) of transition 
from one developmental status to another. State-oriented scales 
are similar to process-oriented scales discussed above in that the ^ 
theoretical basis of such a scale may have biological, ^sociological, , 
*or psychological underpinnings. The major difference between the 
two types of scales is that state- or stage-orietited developmental 
scales assume at Least some degree of discontinuity of processes 
between adjacent developmental periods. * 

Neugarten and Ddtan (1973) point out that, ''Although anthropolo^ 
gists* ••have pointed to discontinuities in cultural conditioning at 
various points .in the lif^ cycle, the redognitiori of the need for 
resocialization in adulthood is relatively new." They suggest that 
"new learning" across the life span occurs in respons to, or antici- 
pation of, tl^ succession of life tasks (or social roles) which 
individuals adopt. For example, familiar "transition" points on a 
sociological scale are entry into school, marriage., retirement, etc. 
The criterion for selecting important transition points' is that tfhe , 



28 



social role .in question be accompanied by a relatively circumscribed 
set of behavioral expectations. In this i;egard, there is strong 
agreement among members of a society concerning the salutatory sig-^ 
nificance of life events (Neugarten & Datan, 1973). 

Discontinuous state scale's have been developed from a psycho- 
logical and biological perspective. For example, the major periods 
in Piaget's theory (e.g.,' sensory-motor, preoperational, concrete * 
operations, and formal operations) constitute fundamentally discon- 
tinuous stages in the individual's life span and Hescribe a specific 
set of behaviors. Similarly, puberty Qonstitutes a biologically 
related transition period. 

' *^ ' 
The use of CA to mark transitions between stages requires that 
CA and the succession of social, psychological, or physical states 
be highly correlated. Neugarten and Datan (1973) have provided such 
evidence frbm a sociological perspective by noting a high degree of 
consensus regarding the* timing' (in terms of ^ CA) of major life events 
in an individual's life span. Similarly, there is general agreement 
among diverse sets of respondents 'regarding tflie chronological a'ge 
boundaries differentiating life periods, (e.g., English and^English, 
1957; Neugarten, Moore, and Love, 1956). 

* 

Reconsideration of the Longitudinal Method and Behavioral Development 

The study of developmental changes in behavior spawns a single, 
basic research paradigm — the longitudinal method. The defining 
property of the method is that a sing'le individual is tested at two 
or more points in time. It is (aTso"*^portant to note that the, method 
is' theoretically neutral since jits us*^ does not require the 'investi- 
gator to 'adopt a specific developmental scale along which ta chart 
the sequence of human development. If longitudinal measurements were' 
collected for several individuals the resultant data permit conclusions 
to be drawn regarding the ,interindividual similarities ' in the sequence 
of behavioral development. When marked similarities in the sequence 
of occurrence of behaviors are observed among the individuals studied, 
the regularities cannot be charted on a developmental scale since the 
longitudinal method make^ no reference either to the point of ^origin 
or the metrip of change. The developmental scale adopted for this 
purpose should be the one which is mo'fet highly correlated with the 
behavior studied. Once adopted the scale specifies *the manner in 
which the data of individual S^s are to be grouped and the nature, of 
the time intervals across which the behaviors are to be described. 

Therefore, alternative developmental research methods are 
derivable only after the investigator adopts a theoretically 
meanin^fti^ scale. ^ For example, cross-sectional measurement 
Xs often used as a convenient substitute 



In this paper, the subsequent use of "developmental scale" is to be 
taken in the above described genetic sense and not in reference to 
any specific metric. 



2-18 



for longitudinal measurement. The "^select ion of the different groups 
af S^s for testing requires that the researcher choose a specific 
.developmental scale. Once tRe scale is chosen, the criterion for 
subject selection and matching become apparent. Additionally, it 
is now possible to specify the alternate longitudinal and cross- 
sectional design specified by the soale. 

In short, the, longitudinal method is a theoretically neutral 
"tod generalized research ^method in developmental inquiry. .Further- 
more, when used in its generalized form, it provides data concerning 
the sequence but not the temporal course of behavioral development. 
Special cases of the longitudinal method (along with* their cross- 
sectional counterparts) are derivable only when the researcher adopts 
a developmental scale, , For example, if CA is selected as the scalar 
metric, _Ss are matched or differentiated according to CA and^ can 
therefore be selected and tested according to either longitudinal 
or cross-sectional sampling procedures. 

t Each develnjMnental scale spawns its own unique longitudinal 
method. A process-oriented developmental scale, for example, may 
involve selecting and matching Ss according to a biological, socio- 
logical, or psychological process (e.g., skeletal age. Shut tleworth, 
1937) and testing the ^s at selected points in time (defined by either 
calendar units or process-related criteria) thereafter. Similarly, 
stage- or state-scales of behavioral development would specify 
matching criteria defined by the stages or states in question. 
Neugarten and Datan (1973) , for example, have described an alternate 
longitydinal paradigm in which the point of origin differs from a 
CA-based scale but which retains the sanje metric of time. In this 
regard, the- functional point of origin of a particular beliavioral 
sequence may be the acceptance of a pSirticular social role (e.g., 
fatherhood) and the patterns of behavior change following this event 
can be charted on a scale of calei^dar time, e.g., fatherhood, father- 
hood + 'one unit, fatherhood + two Xinits, etc. 

The striking parallels between CA-based and process-oriented 
scales are readily apparent. ^ In both cases, behavior change is charte 
in terms of proximity (measured in units of calendar time) to an 
important life event. In addition, birth (or a descriptive CA-based 
scale) and fatherhood (on a process-oriented sociological scale) 
provide the only "benchmark" or point of origin. This suggests an 
underlying continuity of behavior change across time marked from the 
I)oint of origin of the behavior being studied. The scales differ, 
however, since S^s are matched (and differentiated) according to 
criteria defined by t'he different "functional" points of origin for 
the two scales-. 

Parallels to the longitudinal paradigm p'roposed by Neygarten 
and Datan (1973) also exist utilizing theories focused on biological/ 
psychological processes. As an example, the classic study by 



> 

" ' 2-19 

Shuttlesworth (195?) provided data concerning the correlation between 
puberty and the ^'growth spurt" in adolescence. This was accomplished 
by matching Ss for the onset of puberty (rather than CA) and charting 
physical growth from this point forward. Within a psychological 
framework, Piaget (e>g., 1928) also accepts this method by . sugges ting 
that the sequence of behavior change follows a universal order start- 
ing with the onset of psychological periods and stages. Interest- 
ingly, Bijou and Baer (1961, 1965) follow a very similar line of 
teasoning to that of Neugarten and Datan (1973) by suggesting that 
environmental "setting events" influence bel^^vior throughout life. 

The preceding discussion has highlighted several important 
points related to subject selection and matching in developmental 
research. First and fo'remost, the adoption and use of a specific 
'developmental scale requires the researcher to 'adopt certain assump- 
tions relating to point of origin and the metric of time. However, 
as has been suggested, the nominal properties of the point of origin 
are rarely considered in developmental research. Rather, the concern 
in most research is with the study of a developmental process and how 
,it changes with time. Subjects are cltfosen^and tested, on the basis 
of representing the ages or time periods over which th'e process is 
thought* to change. In such cases, the functional point of origin 
for the developmental Study in question is the developmentally 
"youngest" sample. In such cases the nominal and functional point 
of origin for the researcher may be different, e.g., birth vs. six- - 
year-olds; yet the nominal and functional metric of time may be ^ 
identical (e.g., units of calendar time such as^months, years, etc.). 

It is important at this point to discuss additional limitations 
of the sampling model proposed by Schaie (1965). First, Schaie 
limited his model to situations where the researcher has adopted a 
CA-based scale of behavioral development. This is an unnecessary 
restriction of the model. In addition, two additional limitations 
of the model are at issue here. 

The first limitation discussed earlier, has received considerable 
attention by others (e.g., Baltes, 1968; Baltes & Nesselroade, 1974; 
Buss, 1973, Schaie, 1965; Wohlwill, 1973) concerns the functional 
independence of the components of age, cohort, and time. For ex- 
ample, Baltes' (1968) suggestion that the three components are not 
mtually independent, i.e., once two components have b^en defined, the 
third is fixed, is relevant here. As Buss (1973) and Wohlwill (1973) 
have argued, such criticisms relate to methodological rather than 
theoretical concerns. Even though any two of the cotaponents cannot 
be functionally varied independently of the third, the concepts of 
developmental (age) generational (cohort), and secular (time-related) 
change to indeed qualify as separate theoretical concepts (e.g., Buss, 
in press; Troll, 1973) . 



31 



2-20 

"It is important to highlight two additional aspects of the issue 
concerning the independence of the three components. The first 
aspect coacerns the manner in which the thre^ components are defined 
and the way in which populations are matched. ^ First, Schaie's model, 
by adopting CA as a developmental scale not only restricts the 
researcher to indexing behavioral development from birth as a point 
of origin, but also confines the definition of cohort to data of 
btrth rather than some alternate definition, such as, the popula- 
tion of children who entered first grade in September, 1975, etc. ^ 

Any deviation from a CA-based scale requires modification of* the 
general model proposed by Schaie (1965). As an example, if subject^ 
.to be tested were in tems of a sociological state (as a level of 
development) and time of testing, e.g., all subjects who were married 
for the first time in September 1975, the third component, cohort, 
would lose all functional meaning when defined in terms of birthdate. 
Similarly, if cohort is defined in terns of " family lineage " or oue 
of the alternate accepted definitions of generations and generational 
change (e.g., Troll, 1973), time of measurement may be specified, but 
CA loses theoretical and functional meaning. The point is^ if a 
developmental scale other than a CArbased one is selected for use, 
all three componen^ts must be re-examined bo th 'methodologically and ^ 
theoretically. 

The second limitation of Schaie 's developmental model concerns 
the restrictive manner in which the second formal characteristic 
of time-related scal^' ('the metric of change) is defined. That is, 
the use of Schaie *s model restricts the investigator to a sc^le of 
calendar time rather than one which might more properly fit the 
phenomenon under study. I^ile it would be possible, for example., to 
identify samples of subjects on a scal^ of biological development 
(e.g., skeletal age) and to the samples at ^selected testing points 
(e.g., September 1973, and Segtember, 1976) the ^second testing point 
would have to occur after an equal time interval for all subjects 
or else the functional meaning of time of measurement (as defined 
by Schaie) would be lost. In addition, even though the above 
research design (skeletal. age x time) conforms in some respects to 
Schaie *s (1965) cross-sequential design, the main effects of time 
of measurement would more properly reflect developmental change 
than secular change for the two populations. 

The above discussion is not raBant to discount the importance 
of the concepts of age, cohort, and tine of measurement in the study 
of behavioral development. Indeed, the present analysis reaffirms , 
the need to incorporate variants of Schaie *s sequential analyses as 
necessary paradigms in developmental research. I^i fact, the present 
analysis suggQ^ts two additional types of variants of Schaie *s 
sequential paradigms, and leads to the conclusion that Schaie' s model 
itself is restricted in its generalizability . , These points are dis- 
cussed in Chapter 3. 



32 

:SL 



v; 



v.. 




REitERENCE^ 



! 



Baer, D. M, - An a^ge-irrelevantH.concept oF development'. ' Merrill- 
Palmer Quarterly , 1970, l£,^23'8-r245 . 
"^-^ • ^ \ ' ^ \ ^ ' 

Baltes, P. B.. Longitudinal and\ cross-sectional sequences in the 

study cif. age atW generation effects. Human Developaent ; 1968, 
ll, 145-^71 (a)^ . 

Baltes"; P. B., & Goulet, L. R. Exploration of developmental variables 
by manipulation and simvlatipn of age, differences in behavior. 
Human Developnent, 197l'^> 14,1 149-170. ^ ^ 

\~ V • ■ ■ 

Baltes, P. B. & Nesselroade, U. R.\ Cultural change and adolescent 
personality development : ^ An a^lication of longitudinal 
sequences. Developmenta^ Psychoj^ogy , 1972,, 7^, 244-256. . 

Bell R. 0. • Convergence: An ^accelerated lougitsudinal approach. Child 
* ^ » ^ ————— 

Development , 1953, 24, 145-152. f -» 

Bijou, S. W. Ages-, stages and the naturalization di human deveMpraent. 
American Psychologist , 196B, ^1. 419-427. , . . ' 

Bijou, -S. U. & Baer, D. M. Child Developaent , Vol. 1. A systematic 
and enpir'ical theory. New York: Appleton, 1961. 

Bijou, S. W. & Baer, D. II. Child Development , ^Vol. 2, New York: 
Appleton, 1965. 

Jirren, J. E. Principles of research on aging. In J. E. Birren (Ed.), 
Handbook of aging and the individual . Chicago: University of 
Chicago Press, 1959. , - 

Birren, J. E. The psychology of aging .- New York: Prentice-Hall, 
1964. 

Buss, A. R. An extension of developnental'^«odels that separate ^ 
'ontogenetic changes and cohort differei?^. Psychological 
Bulletin , 1973, 80, 466-479. 

Buss, A. R. Generatienal analysis-: Description, explanation, and 
theory. Joifrnal of Social Issues , in press. 

English, H. B. Chrdnological divisions of the life span.- Journal 
of Educational Psychology , 1957, 48, 437-439. 

■ Goulet, L, R. Training, transfer, and the development of complex 
■ behavior. Human Development , 1970, 12(4), 213-240. 



3 ] 



2-22 

Goulet, L. R. The interfaces of acquisition: Models and methods 
for studying the" active, developing organism. In J. R. 
Nesselroade and H. W. Reese (Eds.), Life-Span D evelopmental 
PsycholoRy : Methodological Issues . New York: Academic 
Press, 1973. 

Goulet, L. R., Hay, C. M. & Barclay, C R. Sequential analyses and • 
developmental research nfethods: Descriptions of cyclical 
phenomena. Psychological Bulletin , 1974. 

Goulet, L. R., Williams, K. G. & Hay, C. M. Longitudinal changes in 
intellectual functioning in pre-school children: Schooling and 
age-related effects. Journal of Educational Psychology 1974 . 

Kessen, W. Research design in the study of developmental pioblens. 
In P. H. Mu§sen (Ed.), Handbook o£. research methods in child 
developnant . New York: \Jiley, I960; 36-70. 

Neugarten, B. L. Adult personality: Toward a. psychology of the life 
cycle. In Neugarten, B. L. (Ed.), Middle Age -and Aging , 
Chicago: University of ChicagQ Press, 1968^-^ 137-147 . 

Neugarten, B. L. Personality change in late life: A developmental 
perspective. In C. Eisdorfer & M. P. Lawton (Eds.) The 
psychology of adult development and aging . Washington, D. C: 
American Psychological Association, 1973, 311-338. 

Neugarten, B. L. & Datan, N. Sociological perspectives on the life 
cycle. In P. B. -Baltes and K. 'U^^.Schaie (Eds.), Life-span 
developmental psycli'ology : Personality and socialization, 
tlew York: Academic Press, 1973, 53-71. 

Neugarten, B. L. Moore, J. W. '& Lowe,'j. C. Age norms, age con- 
straints, and adult socialization. American Jo urnal of Socio- 
logy, 1965, 7£, 710-717. 

Schaie, K. W, A general model for the studyVf developmental 
problems. Psychological Bulletin , 1965, 64, 92-107. 

Shuttleworth, F. K. Sexual maturation and the physical growth of 
girls age six to nineteen. Monographs of the S ociety for 
Research in Child De^/^elo'pment , 1937 , 2_, No. 5. 

Troll, L. E. Issues in the study of generations. Aging and Human 
- Development , 1970, 1, 199-218. 

Wohlwill, J. F. The age variable in psychological research.. 
Psychological Review , 1970, 77, 49-64. (b) 

Wohlwill, J.- The study of behavioral development . New York: Academi 
- Press, 1973. 



CHM>TER 3 

GENERAL SAMPLING STRATEGIES FOR B = f (!)• RESEARtH 
General Sampling Designs for B = f(T) Research 



In Chapter 2, the discussion highlighted the fact that Schaie's 
general developmental model represents only one of a family of 
sampling strategies amenable to the study of behavior changes 
associated with time. Other models, similar in form to the one 
Schaie (1965) proposes, may be derived whenever the researcher 
adopts a developmental scale other *than CA. 

0 

The first variant of Schaie's (1965) sequential analyses paral- 
lel his general cfevelopmental model with the exception that a 
developmental scale other than CA is used.' Figure 1 provides an ^ ^ 
example of the model using a developmental index based on sociologi- 
cal criteria. Samples' of S^s (cohorts) who were married f<6r the 
first time in 1970, 1975, and 1980 are tested at the time of marriage 
and in increments of five years thereafter. * ' 

The use of Schaie's developmental, model requires that » the age^ 
and cohort variables share the same nominal and/or functional point 
of origin. The choice of a sociological scale of development (time 
since marriage) leads to a redefinition of the cohort variable (year 
of marriage) in the same manner that CA as a developmental index^ 
presupposes a definition of cohort based on date of bl?:th. Never- 
theless, a sampling design such as that provided in Figure 1 permits . 
cohort-sequential, time-sequential, and cross-sequential analyses to 
be performed if a minimum of six samples of S^s conforming to the sam- 
pling design in Figure 1 are represented. 

Figure 1 provides an example of an alternate model based on 
sociological criteria and parallel models may 'be derived using 
psychological or biological criteria. 

The paradigms basically conform to Schaie's model, and sh^re 
some of the same attributes and limitations. The attributes have been 
fully documented by Schaie (1965), Baltes (1968) and in the present 
paper. The major limitation of Schaie's (1965) model is that the 
three components of developmental change (age, cohort," and time-of- 
testing) cannot be defined independently of one another and this, 
limitation is shared by the variant of the general model -^presented 
in Figure 1. As was ir\entioned in Chapter 2, such difficulties arise 
when tbe scales used to define the age and cohort variable share the 
same nominal and/or functional point of origin. 

However, \t is possible'-'to generate sequential paradigms analo- 
gous to time-, cohort-^ or cro'ss-sequential sampling strategies which 



3:3 



ERIC 



Time 

of 

Measurement 



1970 



1975 



1980 



M 



• \ 

Age Level ' v 

M + 5 ^ears M + 10 years 



1970^ 






1975^ - 


1970^^ 
^ 




1980^^ 


1975^ 


f 

^970 



*Cell entries refer to cohort groups 
defined by rate of marxiage (M) 



Figure 1 . 

-A Sampling Model for Developmental Research 
Based on Sociological Criteria* 



0\) 



^ f 



3-3 



do not share t-his limitation. Figure 2 provides one example of a 
variant of a cohort-sequential design. Cohort is defined by family 
lineage and cievelopmental level - by tbe .sociological state of 
marriage. 



The second variant^ of Schaie's sequential analyses is derivable 
if the assumption is made that age (maturation), cohort (generation) 
and time (secular change) are defined independently of one another. 

The research paradigms parallel the sequential designs proposed 
'by Schaie in that generational, secular, and age changes are the^ 
focus of the investigation. The paradigms also adopt calendar time 
as the merric. However, since the components of agfe, cohort, and 
time-of -measurement ^are by definition uncorrelated, the paradigms 
differ from those proposed by Schaie (1965). 

V 

CA and Other Age Scales of Development 

The previous discussion has highlighted the similarities between 
CA- and alternate developmental scales. It was shown that each 
scale generates its own prototype of longitiidinal and cross-sectional 
sampling strategies and its own variant of the sequential strategies^ 
proposed by Schaie (1965) . - 

The final type of design to be proposed here examines the rela- • 
■'tionships between CA-, sociological-, biological-, and/or psychological- 
scale(s) of development. 

^ Such investigations could take the form specified in, Figure 3a; > 
where Ss representing different levels of CA are tested at the point 
of marTiage and five years thereafter,. The differences between th^ 
row means represent effects attributable to ,CA, whereas differences 
between the column means reflect effects whi'ph covary with time since 
marriage. Both "independent" variables are developmental in nature 
and the results from such an investigation p4rmit inferences to be • , 
made regarding the degree to which performance varies with CA,' time, 
since marriage, or both. And, as such, the design provides Informa- 
tion regarding the sensitivity of two alternate age-scales to the 
phenomenon of interest. ^Nevertheless , the design, even though 
calendar time of measurement is controlled as wtth\any cross-section^l ^ 
sampling procedure does not permit the cohort influences to be 
separated from those' related to development. J 

Figute 3b represents another variant of such a design. It conforms 
in some respects to .Schaie's time-sequential design in that CA and ^ 
time of testing are factorially varied. However, in this^cape, both 
CA and time of testing are factorially varied. However, in this case 
both CA and time since marriage correlate perfectly with calendar time^ • 
(1970,^1975), i.e., Ss from both cohorts were married in 1970. 



37 



Cohort 



Developmental 
Level 



Father 



Soa 



Marriage 

Marriage 
+ 5 years 



Figure 2 

« 

A Cohort-Sequential Design Based .on Independently- 
Defined Cohort and Age Levels 



38 



Figure 3a 



\ 







Marriage 


M + 5 years' 


CA at 


25 


* 

1970 


1975 


time of 








testing 


30 


1975 


• 0 — — ■ r ■ — — 
1 

1980 




- 

• 


Figure 3b 






, Marriage 


M + 5 years 


CA at 


25 


1970 


1975 


time of 








testing* 


30 


1970 


1975 

s 



Cell entries correspond to times of 
measurement 



Figure 3 



Sampling Designs for Developmental Research Varying 
Developmental Level Along Two Dimensions 



33 



3-6 



' The merit of designs such as those described in Figures 3a and 
3b firom the framework of an educational perspective is best illus-- 
trated by a reconsideration of the sampling model for educational 
research presented in Chapter 2 (Figure A) and presented in another 
form in Figure 4. • ' 

There is a paucity of dat^a available utilizing the paradigm 
•exemplified in Figure 4* Hov^ever, scrutiny of literature reveals 
a Set of studies (Elites & Reinert, 1969; Schaie, 1972) conducted 
for other purposes but .which nevertheless provide for comparisons 
in which CA and amount of exposure to school curricula are. ort:ho- 
gbnally varied, ^ Furthermore, there are several sets of data emanat- 
ing- from our laboratory which were conducted for the primary purpose, 
of testing the utility of the sampling procedures presented in Figure A. 
These data provide for within-grade contf^sts (Goulet, Williams, & 
Hay, 1973, in press; Goulet, Williams, Bcrzinou & Hexner^ 1973; Wood & , 
Goulet, 1973a), and between-grade contr^fe (Wood & Goulet, 1973). 

In view of the recent availability of sucfi' data, it is considered 
important to present the results in summary form and to discuss , the 
studies themselves in considerable detail. The studies provide inf6r- 

•mation regarding the independent behavioral correlates of schooling 
and CA for children across the range of CA from four to nine years and 
from nursery school to fourth grade. Also, data are* available across 

. a variety of behavioral domains including intellectuaj^grov^h (Baltes 
& Reinert, 1969; Goulet, Williams & Hay, 1974; Schaie, 1972) -visual- 
perceptual performance (Wood ^gojiiier,' 1973a, 1973b) for single-trial 
free xecall perf ormance,^>ablective estimates of recall ability 

' (Goulet, Williams ^fip-Hty, 1973), and the utilization of rules of • 
addition (Goul^t<i^llians, Bozinou & Hexner, 1973)^ 



Smafoaries o'f each of the sets of data providing within-gr'ade con- 
trast/are^presented in Table 1 and are identified by author and the 
available measure of performance. Table 2 provides the data from the 
single study (WooS & Goulet, 1973b) where between-grade contrasts are 
possible. In each instance except where noted, CA and time of testing 
rfoted, CA and time of testing in the school year are varied and 
superior performance is reflected by higher scores. ^ The row and 
plumn means for each of the matrices in Table 1 represent performance 
.or the main effects of Time of Testing and CA, respectively. In each 
case, *the data represent means' based on independent samples^d the 
data are amenable to analysis within a 2 x 2 factorial desi^with CA 
and Time of Testing as the two factors. In addition, with the exception 
of -parts of the Baltes and Reinert (1969) data or where noted, the 
main effects for CA and for Time of Testing are statistically signifi- 
cant. No interactions were evident in the data. 

In each case the data represent^ thJ performance of children who 
were ejnrolled in the appropriate grade ^or their age. To eliminate 
the possibility of a selection bias related to grade placement,^ the. 



'40 



3-7. ■ 



1 ■ r . 2 



Month of Testing / Sept. Jan.'-^^y ' • Sepr. . Jan:-'. -May 



Sept 



' -.6-6^ ' - - 6-10^ ' 7-2^ 7-16^ 



A Chronological Age • 6-0 6-lQ \ ?r^/^ , J"^ 



C Testing ' 7 X b. -e '*''\h . < 

^ / 6«6 ^6-10^ 7-2^ > 7^^V, ' 



Figure A | . , 
An Extended Sampling Sttrategy for 
Testing SchQQl-Ag6 Children 




1 



o ' 41 

ERIC 



c 



o 



03 

o 



XI 



XI 















• 


• 


• 






vo 










ro 




• 


• 


• 















XI 



00 

I 



00 

\ 

00 



(D • 

pt 
pt 

Q 

o 
o 

§ 

pt 

H- 

CJ 

















• 


• 










• hJ 




00 


00 




• 


• 


• 






' o 




J 












« 


• 









00 
I 



"CO 

I 

00 



n 
> 



XI 











ro 


00 




03 




• 


• 


• 


1 


• 




M 








U) 


U) 


U) 




to 




00 


00 






• 


• 


• 


1 


• 








03 





to 

to 

to 



to 

* I 



to 



> 



I 

03 



03 

01 



> 

rt 

CD 
rt 



CD 



O 

o 

rt 
C 

R 



00 



00 XI 



fo to 



Ul' 03 



XI 



/ 



» 



3-10 



o 



O 



CX5 0> 
I 

O Ln 



> 



?0 



N? ON 
CD O 



00 



O XI 



o . 
< 

















ON 


• 






1 




In 


o> 


Ln 




N> 




ON 




\o 


-tN 


1 


• 










J-» 




O 






\ 






UJ 


N> 


XI 










• 















> 



ID 







o 






o 






c 






M 






Q 






rt 












^ 










r< 


M 




r< 


M 




O 


H- 


H 




03 




cn 


B 


(D 




cn 


OS 


rt 


1 


rr 


O 


W 


3 




o 


(D 


O 


N 






H- 


rr 


H« 




o 


rr 


O 




(D 


c: 






1 










O 


(D 


















rt) 
























VD 






vj 






CO 











2: 
















:3 








rr 








0 














XI 










CO 






CX) 




to 


ON 


• 




• 


1 






00 


ON 


CX) 


vo 


00 


ON 


ON 




to 


1 






• 


■ e 






0 


0 



03 



3 

rr 
1^ 



00' 



> 



o 



XI 








03 


CO 










03 


ON 




• 


• 


I 








ON 



o 
ar 

OS 
(0 



> 



00 






ON 








1 


• 


• 










03 


0 



rt) 
EI 
rr 

> 
rt) 



03 X\ 



H 

cr 
rt) 



o 
o 

rr 

a 



ON 

00 



o 



ON 

VD 



00 



o 
o 
rr 



ON 

ON- 



vo 



ON 

VO 



4N 
I 



I 



5 



C 

rt) 
rr 
I 



o 
I 



CX5 XI 



ERIC 



44 



3-11 



XI 

o 

00 



CX5 



ON 



o 
o 



VO 



I 



I 



> 



o 



o 
o 

? 

o 
o 
c 



> 



2: 
o 



00 , 



00 



00 



M Ln ^ 



00 
O 



a 

S 

a 

O 
9 



-J 00 Xf 
Ui ' 00 



3 M 



XI 



00 



M o 



00 



I 



> 



VO 



I 

M VO 



o 

H 
H 

CO 

§ 



o 
c 

H 

a 
I 



or 
5 
a 



H 
VO 



H 

cr 

H 



o 
o 

3 
rr 
H- 

3 
C 

(T> 



VO 



VO X| 



Table 2 

Summary Means for Research Permitting 
Betwe^n-Grade (Matched-CA) Contrasts 





WoQd-Goulet 


(1973b> 






Errors 








Grade 










1 








(5-11) 










X 


Oct* 


13.8 


9.5 


11.7 


April 


7.6 


6.5 


7.0 




X 10.7 


8.0 





3-13 , 

the children were selected for testing from the middle 70 percent 
of the age range within a class; i.e», the youngest and oldest 
children.within a grade were not sampled* 

Table 1 provides data taken from Baltes & Reinert (1969). The 
data represent raw score performance on each of four subtes^ of 
intelligence (including letter series, word completion, basic arithme- 
tic, and letter counting) which Wre collected in the months of March 
and July for samples ranging in CA from 8-4 to 8-8 (third grade) 
years in Study I, and '9-4 to 9.-8* ')^ars (fourth grade) in Study II. 
Therefore, only the directionality ' of results is discussed. As is 
apparent, the diagonal contrast (up^r^left anS lower-right cell^means) 
provides data representing longitudinal changes in performance, the 
vertical (cross-row) contrast rep-resents a time-iag comparison, and the 
horizontal ' (cross-solumn) contrast represents a cross-sectiona^. compari 
son. Only the longitudinal comparisou^involves mean differences Vhich 
confound CA and length of schooling. Ois may be seen from these d,ata, 
the longitudinal contrasts provide an estimate of change which exceeds 
that of the cross-sectional and time-lag contrasts. Also, with the 
exception of the letter-counting measure, the column and row means 
suggest that amount of school experience and CA are each positively 
correlated with performance. With the letter-covinting measure, the 
relationbetween CA and performance is positive and the relation 
between amount of school experience and performance is negative. Such 
.opposing effects of the two variables leave a longitudinal function , 
which suggests no (or even slightly negative) changes in performance 
oVer the f our-month^nterval which separated the two testing periods. 

The second sets' of data in Table 1 are taken from studies by 
Schaie '(1972) and Goulet, Williams and Hay, 1974. The cell means 
represent the Mental Age of first-grade (Schaie, 1972) and nursery- 
school children (Goulet, Williams, & Hay, 1974'). Intellectual per- 
formance was found to relate positively to amount of schooling and to 
CA for both samples ofr measures^jiwhich were taken in 1933 (Sqhaie, 
.1972), and 1973 (Goulet, Williams, & Ha^^^ 1974) and^ f or both boys 
dnd girls (Schaie, 1972). 

The third set of data were taken from Goulet, Williams, Bozinoxj, 
and Hexner (1973)., The cell means represent performance o*h a paired- 
associates transfer task. In the Rule condition, rapid. acquifiiuLon 
was expected if the children (first-grade) used an addition rule of 
"add 1" to- learn' the individual paired associates in the list. Nonuse 
of the rule would interfere with performance. Thus, superior per- 
formance is reflected by fewer errors to^ criterion. In the Interfer- 
ence condition, the children learned a transfer list of paired associ- 
ates where no rule was possible and interference ^(negative transfer) 
was expected. As the data suggest, superior performance was posj^^ively 



3-14 



related to amount of schooling in the Rule condition, whereas the 
reverse was true in the Interference condition. Chronological age 
was unrelated to performance in the Rule condition, and the older 
children learned the transfer task faster (fewer errors) in the 
Interference condition. 

• The data provided by Goulet, Williams, and Hay (1973) take two 
forms. The first set of data refer to childrens' estimates of their 
ability for immediate recall. Th^ children were shown up to 10^ 
famiXiar, but unrelated, pictures and they were asked to judge how 
many they could remember if they were shown once. The secoiffd set 
of data refers to the childrens' actual recall span; i.e., the long- 
est series of pictures they cauld remember without error after one 
presentation. As may be seen from these data, subjective estimates 
of recall ability relate positively to CA and negatively to amount 
of schooling. For the data on recall span, null effects of *CA and 
negative effects related to amount of schooling are found. 

The, data taken from Wood and Goulet (1973a) represent raw score 
performance on the Bender-Gestalt Visual Motor Test. The data 
represent error scores so superior performance is represented by lower 
scores. Again, amount of schooling is positively' related to better 
performance, with null effects related to CA. 

The last set of data (presented in Table 2) deviate substantially 
from those contained in Table 1. First, the data provide for between- 
gr4^ contrasts of matched-CA children. Second, the data provide for 
lof^^tudinal measurement; for these samples" across the period from 
October to April. Thus, the main effect related to school grade repre- 
sents performance differences for samples who differ by one year in 
amount of schooling. The main effect- for time of measurement, as with 
all longitudinal ISntra^tfi confounds CA and time of testing and thus 
^e results cannot be unequivocally attributed to factors related to 
CA or schooling. Nevertheless, the between-grade^ effect suggests 
pronounced f acilitativgi influenc^s^of schooling even though the Ss 
are matched on CA. 

Datar^SJich as those pre$ented„Jji.^^Tables 1 and 2 provide support 
for ..the utility of ^utilizing sequential sampling strategies when age 
(developmental leyel) is varied simultaneously with two developmental 
scales. ? \ - 

There ar^ a number of issues which warrant further consideration. 
The first point of concern is that most ^all-scale studies and cer- 
tainly all available large-scale studies of student development have 
relied on simple cross-sectional or longitudinal sampling procedures. 
Examples here are the Survey of Equality of Educational Opportunity 
(Coleman, 1971) which used a- cross-sect-ional design and the Growth 
Study conducted at the Educational Testing Service (Anderson & Maier, ^ 
1963; Hilton & Meyers, 1967) which involved a longitudinal design. 
As Hilton and Patrick (1970) have noted, the results of both of these 
studies confound the developmental changes of primary interest with 



generational or secular change -factors , respectively, which occurred 
for the samples tested. Just as important for present purposes, the 
above studies were initiated for the purpose of explicating the 
influences of school experiences across grades and yet provide np 
estimates of these, effects . 

The data provided in Tables 1 and 2 uniformly provide support for 
the assumpxion that influences of schooling exist- independently of 
those which may be expected from normal aging; i.e., from the cumula- 
tive/ influences of past experience and/or maturation (Baltes & Goulet, 
1971; Schaie, 1965), and also suggest the utility of providing inde- 
pendent estimates of performance associated with nonschool-related 
changes in chronological age. Such estimates become especially 
important under conditions where the factors associated with CA and 
school experience may have opposing effects, (e.g., Baltes & Reinert, 
1969; Goulet, Williams, Bozinou, & Hexner, 1973; Goulet, Williams & 
Hay, 1973). In this regard, the suggestions offered her'e parallel 
those of Schaie (1965), Baltes (1968), Hilton and Patrick (1970) 
anckothers who have been primarily concerned with separating sources 
of Variance associated with generational, secular, ^and age change in 
student development . 

Nevertheless, i,t is not the intent here to elevate either chrono- 
logical age nor amount of schoal experience to the status of an^ experi- 
mental/independent variable. Chronological age remains a descriptive, 
biotic variable (as indeed does school experience in the context in 
which it is used here)* since it cannot be experimentally manipulated, 
nor replicated. That is not to say that CA is a useless variable. 
It remains one. of the most useful ways in which to classify or cate- 
gorize children, (Baltes & Goulet, 1971; Kessen, 1961; Wohlwill, 1970) 
and by which to chart behavioral change in research of a developmental 
nature. In the context of the present paper, CA-related changes in 
behavior are divided into two components, those which vary with school- 
ing, and those associated with nonschool-related changes associated 
with CA. 

A second point is that none of the problems in educational research 
^ are vitiated by the use of school grade, rather than chronological age, 
in such studies. Such distinction is obviously important in educational 
research but only to the extent that it is made meaningful through the" 
assessment of the behavioral changes which occur over the^ school year 
for the grade samples tested and to the extent that other CA-related 
factors are controlled. 

It is also important to* mention that the sampling strategy sug- 
gested in Figure 4 is similar to certain popular designs useci in 
educational research. One^example is the time by treatment design 
where two or more randomly selected groups of children matched in CA, 
school grade, etc., are exposed to different school curricula .over 
some instructional period and the performance of the groups is 



3-16 



contrasted at the end of the instructional period, feuch a design, 
which involves elements of both longitudinal and experimental methods, 
controls for CA between the two groups of children. Unfortunately, 
the design suffers from the- fact that the children are both older and 
have undergone the instructional sequence at the end of training. 
Thus, the performance differences among the experimental groups re- 
flects not only the independent influences of the instructional sequence 
but also the interaction between CA and the instructional treatments 
in influencing performances (Goulet, 1970). This inference holds even 
though Campbell and Stanley (1963) refer to such a design as a '*true 
experimental desi^." It is not until CA is incorporated into the 
design that the interaction of CA and instructional treatments and the 
independent influences of the instructional treatment upon performance 
may be separatred. As is apparent, this modification of the design 
has each of the elements of the sampling plan exemplified in Figure 2 — 
of course, with the desirable addition of an fexperiment^l treatment. 

The primary issue considered in this paper concerns the assessment 
of the effects of educational intervention (used in the broad sense) 
on performance over the period of a school year or shorter interval. 
However, as has alread^y •been.jnentio;;ied,^he influences of schooling 
are usually not discernible from other CA-relared influences on p^r^ 
fonaance. That is no't to say that the impact of or effects of exposure 
to the school curriculum can be considered to be independent of behav- 
ioral development. Rather, school learning must be considered to be 
one of the components in the developmental process*. It is for the 
latter reason that alternate experimental designs have been developed 
in developmental psychology to provide estimates of the ef f ects V^f 
educational experiences on performance unbiased by behavioral devel- 
opment. One such design involves th5 simulation or "acceleration" of 
the process through the provision of massed training or practice 
(Baltes & Goulet, 1971; Goulet, 1968). Such an experimental strategy 
is used very often in contemporary studies concerned with cognitive 
development (e.g., Sigel & Hooper, 1968; Gellman, 1969). However, , 
such approaches, although appropriate for the study of developmental 
phenomena, cannot be generalized directly to school situations. This 
is true because: (1) It is^not possible either to identify the range 
of experiences acquired in or as a direct result, of the interaction 
in school; nor is it possible to simulate them in their entirety in ^ 
controlled or laboratory situations; and, (2) Behavioral change induced 
through massed practice over a short term must, of necessity, be 
limited in scope. Also, attempts to generalize the findings to school 
situations are severely limited because of the possibility of an inter- 
action between time and the acquisition of the behavioral phenomena 'of 
interest. In other words, the product of school experierixies are 
acquired over a long period ^nd through a variety of media, including 
the teacher, age-mates, and non-school situations prompted by school 
curriculum. There is no reason to expect that the effects of massed 
practice on specified tasks have effects which -are 'isomorphic with 
those which are acquired ^as ^a result of schooling over the school ye^r. 
Finally, studies using such a design focus (implicitly or explicitly) 




3-^17 



on the identification of variables which influence student learning 
rather than on the description of education-related behavior change. 
While such research is needed, it do^ not lead to the 'types of 
information provided when using the sampling plan suggested here. 

There is a second way to provide direct estimates of the effects 
of school experience which are unbiased by independent time or age- 
rOaiated components of behavioral change. In the most simple case, 
the procedure would involve the comparison of two* groups of children 
across tine (e.^., the schoo^L year) under conditions where both groups 
were eligible for acceptance into school ^but where one of the two 
groups was enrolled in school and one wasn't. However, it is extremely 
difficult to find "random" samples of children who are of school age 
but who have not been enrolled in schoc^l. And, even if such a sample 
were available in the general population it would be impossible to 
match them with children who were enrolled. The very conditions which 
precipitated the lack of enrollment would bias the sample. Campbell 
and Stahley have discussed these issues in detail. As is apparent, 
the sampling plan presented ^in Figure 2 utilizes a research strategy 
which .capitalizes on the latter method while avoiding the potential^ 
sources of confounding when it is used. 

> SCHOOL EXPERIENCES, CA, AKD THE DIRECTIONALITY OP BEHAVIOR CHANGE 

The intent of this paper is not to comment directly on either 
the nature of the influences of schooling or the relation between 
performance and amount of schooling. Nor is it possible to specify 
a priori within the context of the sampling plan exemplified in 
Figure 2, either the magnitude or direction of the influences of 
'factors related to CA and school experience on^ per f orraance.. Hever- 
^ tlieless, it is appropriate at this pime to reiterate some of the general 
inferences which may be drawn from the data presented in Tables 1 and 
2 and other sections of the paper* These inferences are provided 
below and appropriate discussion follows each point. 

1. Availa"ble data suggest the utility of adopt-ing .the samplijig 
plan in Figi:\re 4 for educational i;esearch purposes and, although 

only few available studies permit contrasts of the type required^, each 
provides evidence suggesting independent effects associated with CA 
and amount of ''schooling over periods as short as four months. 

2. The relation between CA and performance and amount of iichooling 
and performance may be complementary (either positive or negative) or 
opposing over the sam^ pei^d. 

i 

The point of interest heTre is .that the relation between CA and- per- 
formance is not uniformly positive^ djfrtng the years of formal education. 
In fact there is a substantial amount' of evidence suggesting, for 
example, that the relatiA between CA ^hd performance- in problem-solving 
tasks i^s curvilinear over the age range from three to eighteen (e.g., 
.Goulet & Goodwin, 1970; Weir, 1964). ^l^ile the series of studies from 
which sxyzh inferetices were drawn have involved cross-sectional sampling 



" . 3-J8 

procedures, there are probal)ly many instances of behaviors W!iich cor- 
relate positively with CA and nega^tively with amcJynt o5 schooling 
(or vice versa) over the same time period. ' * 

u 

3. A basic premise here is that designs used in educational 
research require sampling and testing at* least at two points within 
the school year for S^s in the, same grade. It is only with such a * 
sampling plan that the behavior changes Which occur over^fhis period 
'can be assessed. Such suggestions have already been made (e.g., 
Campbell & Stanley, 1^3) and further reiteration regarding this 
point is unnecessar}^ Nevertheless, within-year^ as opposed to bdtween- 
year times of testing should also minimize*confounding due to attri- 
tion in educational research (e.g., Hilton & Patrick, 1970). 

4. A central assumption is that the non-scfe^ol related* correlates 
of behaid.oral development (as indexed by variations in CA) must be 
controli^d before the influences of educati-C^Tal intervention can be 
assessed. This assumption is similar to that made by Schaie (1965) 
and Baltes (1968) in their attempts to differentiate age change from 
generational and secular change in developmental research. ; 

5. Although measures of achievement over periods of schooling 
generally show at least modest gains, reviewers of stich research have 
been quick to mention that the achievement gaips observed are as likely 
^ttributsblj to- "maturation" to\he influences oT instruction 
'(Austin, Rogers,' & Walbesser, ^ 7 2^?^ Furthermore, sutih rfeviewers- - ^ 
have ^Lamented the fact that educational research directed to assessing 
the inTljaences of schooling have provided no data demonstrating that 
the gains were maintained over time, especially in contrast to groups 
not exposed to instruction over the same period. The use of U^e 

jpampling plan in Figure 4 provides for such^stimates. 

^ * 6. The suggestions contained in the present paper also hold in 
the "font ext of the norming and standardization" o£ achie^)ement tests. 
That i^ most s-tandardized tests have utilized either cross-sectional- 
or longitudinal sampling ^ocedures in obtaining their nornativ-e sample. 
The jt>iases which- result from sruch a sampling procedure will vary as 
a result of date af^ testing', type of sampling procedure used, and the 
relation between amount of schooling," CA and performance on the 
Standardized test. These biases have been demonstrated by Goulety. 

, Williams and H^y (1974) and readers are referred * to' this paper for a 
complete dis.cussion, of this point*. 

Some final comments concerning the influences »of schooling are 
warranted. First, there is^no intent i to ' imply that the results attrib- 
uted to the influences of school ^ experience in the present study are 
jdirectly or, exclusively attributable to the "in-classroom"^ experiences 
of the children* Rath^er, such influences may take many forms, ranging 
from the effects of the different forms of social interactions. 



4 



ERIC 



3-19' 

i 

environmental contexts, and parental or peer demands which confront 
the children while they , are 'enrolled in school. Such potential 
caveats do not vitiate the use of the ^proposed sampling model since 
it is appropriate for use in Conjunction with designs incorporating 
experimental methods which are available for educational research 
and for designs 'concerned with the evaluation of the influences of 
educational programs. 

A Reconsideration of, the Cohort Variable ' • 

' We have suggBsted previously that the definition of the cohort . 

.variable need not >be restricted to date of birth as Schaie (1965) 
has assumed. Such a definition is most appropriate, perhaps for 
studies concerned with the behavior and development of infants (e^fi^j 
Weatherford & Cohen, 1973). However, even in these instances, th6^ 

' definition can be j:alled into question. As an example, Fantz, Fagan 
ahd Miranda (igTsf^have suggested that date of conception, rather 
than date of birth, is a more appropriate index by which to identify 
4:he "origin" of life. Similarly, genetic influences on behavior 
assuredly profit from a definition of cohort based on family lineage. 
Balt^es and Rienert (1969) and Buss (in press), and others have also 

! provi^^d compelling .discussions which question the iuterj^re tations 
of "cohort" effects dr^wn from studies adopting Schaie^s definition/ ■ 

. Like age, cohort variable can take many forms having a biologilcal-r- • 
' sociological, or psycho logical basis. ,For example, cohort can be defined 
by social or environmenTtt- f act'^ors which are shared 'by a specific 
segment; of satiety at the same time (e.g. , ^entrahce into school, gjradu- ^ 
ation, etc.)'or'by a society Ss a'^^rhole (e/g., war, depression). ^Ma.t-_ _ 
ters artre made even more complex when it is, considered that many^of 
these, event^are correlated with CA, time of measurement, and date of 
birth. For example, the social state itt&rriage;^ is correlated with v 
'age in the general population but nevertheless may have^jpronounced 
behavioral correlates which exist either independently or in inter- 
action with age. , * ' j 

LONG-TERII DEVELOPMENTAL RESEARCH 

Birren (1959) noted tha absence of developmental scales which 
reflect biological, psychological, or sociological "age" over the 
long term, dnd Wohlwill (1973) has recently reiterated this conclusioji. 
For this reason chronological age continues to serve as the predomi- 
nant criterion for subject selection and matching in developmental 
research. ^ It is important to note that the reasons for using chrono-- 
logical age vary Widely across different researchers and different 
pt-tidies. For example, CA may b^ used bacause pur society is "age 
graded," because CA corre;Lates with biological, or psychological , 
(development, etc. Nevertheless', such relationships are not neces- 
sarily' stable over the long term (e.g. , ,Neugar ten &r Moore, '1968)^. 

' A second point is that very little developmental research is con- t 
cerned with behavior change oyer a large segment of the life span. 
Impedi»ents to life-span research have included the artificial segmen- 
tation of tha life-span as well as the failure of developmental 
theor.ies to encompass a whole-life perspective. 




' 3-20 

, In addition, it has been noted here that developmental rest 
rare-ly attend to the point of origin as a nominal property of a\ 
based scale. Rather, development (i.e., behavior change oyer ti't 
is examined in relation &o the developmehtally "youngest" sample 
included in the investigation. The suggestion here is that developmental 
change is most prop'erly assessed in rerajtian to a sample selected > 
and defined in terms of process-d^^ ined criteria directly related to^ 
the theory .or hypotheses central -t© the investigation. Thus, the seg- 
ment of .the life span which is sampled* in a developmental study *may 
be restricted to the period over which the process is assumed to 
influence behavior. Another implication is that the construction and 
use of developmental ' scales based on process-related criteria xxeed not 
encompass the life-span unl^s -the process itself is assumed to be of 
central importance across this period. Long-term developmental changes 
in behavior may not be properly represented using a single ^developmental 
scale. More important fpr present purposes, however, /ds that shorter- 
terra changes may be efficiently described through the selection of 
a scale defined by a functional point of origin and a metric of time 
in the manner illustrated in Tables 2-e5, 

SUMMARY * 

C^iapters 2 and , 3 haVe highlighted the methodological complexities 
involved in the conduct of research concerned with studying B - f (T) 
phenomena. The attempts to resolve the complexities through the use 
of sequential *sa\pling strategies such as those provided by Scahie 
■ (1965) and Baltes (19*68) must be viewed as' very significant advance- 
ments, ;i?oi7ever, it- has been ^hown that the use of^ a sequential design 
(as a replacement for*the longi-tjjdinal , cross-sectional,'*or t^e-lag \ 
design) is rto panacea unless the hypothesis guiding the study of the - 
B = f (T) phenomena o^ interest are fi,rmly grounded in theory. Fur- . 
therraore, the theory guii^ng the investigation should specify the under- 
lying scale along which^'the B^= f(T) phenomena chan^^ and the major 
factors (e,g,, age, time-of-measurement , or- cohort) influencing behavior 
and performance for the time period, social context, and population % 
being studied. The theory should also provide strong diyectign tp the r 
researcl\er itT selecting the times of ^ testing'^and the ages of children 
from which to^ collect^ data. Finally, /the theoi^y must specify the 
relatioh between the factors of age, time-of-measurement, and coho 
It is only when this is accomplished that a sampling m(^4el conforming 
to Schaie^s general developmental model or the use of one o£ the Schaie ' 
(1965) ai;id Baltes (1968) can be selected as the optimal sampling strategy 
for the behaviors being . studied, ■ The^ntrover^sy between Schaie (1965) 
and Baltes (1968) as %o whether Schaie' s model conf ormg ^to a trifactor 
or bifactor model is a <^ase in point whlych can x>nly be settled^ i^ the 
context of a theory which speaks ^arectly to these issues and those 
discussed in this section, ^ j7) 

■■ • i 



.5.1 



J 



0 ' 3-21 

REFERENCES 

Anderson, B., & Mader, H. 34,000 pupils and how they grow. 
Journal of Teacher Education , 1?8?^. ' 

Austin, G. R. , Rogers^ B. G., & Walbesser, H. H. , Jr. The effec- 
tiveness of summer compensatory education: A review of the 
research. Review of Educational Research ,4 1972^ 42 , 171-182. 

Baltes,'P. B. Longitudinal and cross-sectional sequences in the 

study of - age and generation effects. Human Development , 1968, ^ 

^ 11,* 145^171. A 
> — * 

'Baltes, P.'^'B., & Goulet, R. Exploration pf developmental vari- 
^ ables by manipulation and simulation of age differences in 
behavior. Human* Development , J^97J^,^ 14_, 149-170. 

Baltes, P. B. , & Reinert, G. Cohort effepfe^ in cognitive develop 

ment of children as revealed by c^rtss-sectional sequence^ 
♦ ^ Developmental Psychology , L96'9, j^, 169-177, 

' ' ^ * ^ / 

Birren, J. E4 The psychology ol aging > New York: Prentice-Hall, 
1964. 



Campbell, J). T., & Stanley, J. C. Experimental "and quasi-experimental 
'designs for research on te*aching. -In N. L. Gag^ (^d.). Handbook 
^ , of research on teaching . Chicago: Rand fi^IcNally, 71963, ^ , 

171-246. 



Coleman, J. S., et al . Equality of educational opportunity . Catalog 
, No. FS 5.238:38000. Educational Testing Service Annual Report. 
Princeton, N. J.: Educational Testing Service, 1964. 



Fantz, L., Fagan, J. Miranda, S. B. Early visual selectivity 

as a function of pattern variables previous exposure, age from 
birth and conception, and expected cognitive de'fici*t. In 
L. Cohen and P. Salatat^k ' -(Eds.), Infant Perception . Vol 1. 
New York: ' Acatlemic Press, .1975.- 

Gelman, R. Conservation acquisition: A problem of learning to 

attend to relevant attributes. Journal of Experimental Child 
Psychology , 1969, 7_, 167-187. 

Goulet, L. R. Verbal learning in children: Implications for 
developmental research,. Psychological Bulletin , 1968, 69 , 
359-376. - • 

Goulet-, L. R. Training, transfer, and the development of complex 
behavior. Human Development , 1970, 13 (4), 213-240. ^ 



ERIC 



3-22 



Goulet, L. R,, & Gqodwin, K. S. Development and choice behavior in 
^probabilistic and problem-solving 'tasks. In H. Mk Reese & 
L, P. Lipsitt (Eds.), Advances in child development and 
behavior , V, New York; . Academic Press, 1^70, 213-254. 

Goulet, L. R. , Williams, K. G. , Bo2inou, E., & Hexner, P. Z. Longi- 
tudinal and time-lag differences in rule utilization schooling 
sand age-related effects. Unpublishejd manuscript, 1973. 

"Goulet,*!. i^. , Williams, K, G.,,& Hay, C. M. Age- and s^hooling- 
^ '"^relatad changes in memory performance. Unpublished manuscript, 

* 

Goulet, L. R. , Williams, K. G. , & Hay; C. M. ' Longitudinal changes 
in intellectual functioning in pre-schooX children; Schooling 
- / §nd -age-relatad , effects. Journal of Educational Psychology , 
/ . ' 1974, 66_, 65^7-662.'' ^ . 



4 



/ 



Hilton, L^^^Si Meyers, A. E. Personal .background, experier^ce , 
^nd' school achievement; An investigation of the contributign 
of questionnaire data to ac^^eMc prediction. Journal of 
Educational Measuremenj> frl967 , 4_,' 69-80. ^ 

Hilton, T. L. , & Patrie^tc, C. Cross-sectional versus longitudinal , 
data: An empi:pdxal comparison of mean differences in academic 
growth. Journal of Educational Measurement , 1970, 7_, 15-24. 

Kessen, W. Research design in the study of developmental problems. 
In P. H. Mussen (Ed.),^ handbook of research methods in cl]ild 
development . New York; Wiley, 1960, *36-70. 

• ' «> 

Neugarten, B. L. , Moofe^, *J.^ W. & Lowe, J. G. Age norms, age con- 
straints, and adult socialization. ' Anyerican Journal of 
Sociology , -1965,- 70, 710-717. ^ ' ^ 

Schaie, K. W. A general model for the study of developmental problems. 
> ^^sychologicaI Bulletin , 1965, 64_, 92-107. 

Schaie, K. W. Limitation's on the generali^ability of growth curves 
of intelligence: A ireanalysUs of some data from the Harvard#^ 
Growth 'Study. Human Development , 1972, 13, 141-152. 

Sigel,. I. E., & Hooper, F. H. Logical thinking in children ; Research 
based on Piaget's theory . New York; H0lt, Rinehaft and Winston, 
1968. 

Weatherford, M. J. & Cohen^ L. B. Developmental .dhanges in infant 

visual preferences for novelty and familiatity . Child Develop- 
ment , 1973, 44, 416-424. > ' 



5 o 



Weir, M. Developmental iffianges in problem-solving strategies, 
^ ^ Psychological Review/ l964, n, .473-490, 

Wohlwill, J. F, The' ag^i variable in psychological research, 
Pstchological Reyiew , 1970, 49-64. 

Wohlwill, J, The ^tudy of behavioral development . New York: 
Academic Pre^s, 1973, 

.Wood, P, A, , & Goulet, L, R. Age and school experience as factors 
^ related to visual perception, American Educational Research . 
Journal , 1973a. * ^ * 

Wood, P, A,, & Goulet, L, R, Longitudinal and grade-related diff^'i^ 
ences in ^risual-perceptual performance, Unpublishe4 manusqlfipt 
1973, . ^ j <C"^^^/' 



/ 



CHAPTER 4 



iJHE DETERMINATION OF THE SIGNIFICMCE OF CHANGE 

BETWEEN PRE ANT) POSmSTING PERIODS 

The measurement of change has been a favorite topic of psychometri- 
cians. for years. It is a topic with considerable problems rpany of which 
are best avoided by following the advtfe of X^^onbach and^urby (1970) to 
"•..investigators who ask questions regarding gadn scoresi-^.'* fhaf' th^y 
"•••frame th"eir qifestions in other ways*' — . 



"^In many situations, gain scores appear to be the natural measure to 
be obtained* 1^ some instances, however, the formulation of the questions 
in terras^ of gains introduces unnecessary problems. In ^ther instances 
the gain formulation gives the- illusion that certain types of inferences 
can be made when in fact^ £hey are not justified. In the latter case,^ 
the gaiu-f ormulation concealfs limitations that are inherent in tjie data. 

^ In this chapter some of the major issues that arise in the me 
/mfent of change are reviev/ed and, where possible, alternative approaches 



/ 



^e discussed. The measurement of individual differences considered 
first* This is followed by;, a discussion of some of the concerns involved 
in inferring treatment effects from group differences. The chapter is 
tiien concluded with a section^n accountability systems based on student 
achievement . 

INDIVIDUAL DIFFERENCES 

Some of the best known problems in the meastirement of change 4rise 
in situation-s where there is an intere^^itx measuring individual differ- 
ences. "It may be desired to identif;5,Kindividuals who gain unusually 
large (or small) amounts so that^^diese individuals nay be given special 
treatment. In the case of sone^erf ormance contracts, individual gain 
scores have been used as the^asis of detemining payment to" contractors. 
In other situations there^ay be an interest in identifying the correlates 
of change. V-liile not iifvolving individual change scores, as such, 
correlational uses o^change scores are also considered under tt)e heading 




change 

of individual differences. 
Difference Scopes 



The iTOSt natural measure of change from one point in time tp^ another 
is the jEflmple difference score. The dieter quite naturally is interested 
in the difference between his pre diet weight and his post diet weight. 
It >is somewhat ironic that this simple procedure results in a score 
-several major defects. 



Ng^ative correlation with pretest (e.g., Bereiter, 1963; Thorndike, 
1966')^; A major disadvantage o£ the simple difference score i^ that it 
typi-cdlly has a negatxve correlation with the pretest. The correlation 



4-2 



of a pre medfeure, X, with the difference between a post measure, Y, and 
that pre measure is 



^ = Pxy Ov - Ox 

PxD r " '■ 



(4.1) 



Pxy 



where D = Y -*X, and Oy are the standard deviations of X and Y respec- 
tively, and- Dxy is the correlation between X and Y, , It is clear from an 
inspection of the numerator in equation 4.1^ that the correlation between 
D and X will be negative unless p^y Oy is, greater than Ox- iypically, 
p Oy will be smaller than because the correlation between X and Y 

must be less than one and the standard deviations of thfe pre and post 
measures are often of relatively similaY magnitude. Although there is 
a tendency for the correlation to be negative it is, of course, possible 
for the correlation to be positive but only if the standard deviation of 
the post measure, Y, is larger than that of the pre measure, X, and _ 
generally substantially so. It should also be noted that since the two 
terms in the numerator of equation 4.1 are of opposite sign, the magnitude 
of the correlation will usually be small in ' absolute value. 

An implication of the negative correlation between D and X, is that 
large 'positive D's are more likely to be observed for. persons with low X 
Scores whereas persons with high X scores would have large positive D s 
only rarely. Thus, if individuals with high D scores are to be selected, 
there will be an overrepresentation 6f people with low X-' scores as an 
artifact due to the negative correlation between D and X. 

Low Reliab-fM-ty- (e.g^ Lord, 1963) ^ Given the standard assumptions 
of classical test^-h^dry, the reliability of a difference score is 



Pxx' ol+ Pyy^4 - 2 Pxy O^Oy^ 

pdd' = ■■ — -J-..;- (^.2) 



0^ + 0^- 2p a^L,, 
X y xy X y 

,2 

where p^x' and 0^ are the pretest ' reliability and variance, Pyy. and Oy 
are the posttest reliability and variance, and p^y isthe correlation 
between pre and posttes'ts. Consider the special case of 4.2 where 

^. = Oy an^,Pxx' = Pyy' = P 

then Pdd' can be written 

. Pdd' = ^ ~ Pxy . (4,3) 
1 - Pxv 



5 

4-3 



ERIC 



o 

K 



'4. 

Althou^i 4.3 appli'^s^ only for a specialized situation it may be instructive 
to consider values of P^qi for selected values of p and p^^. This is 

done in Table 4-1 and as can be seen there, the value of p^pi is dis- 

couragingly low. when p^y is at all large. 

Of course, one way to obtain more reliable difference scores is to ^ 
have a low correlation be^een pre and post scores. Under such circOinstances , 
however, it is questionable that the pre and post measures are gett^^ng ar 
the sane construct which would seem to be a j)rerequisite for the difference 
score to be interpreted as an index of groyth. 

An implication of the low reliability of difference scores is that it 
is quite risky to ^ake any important decisions- about individuals on the 
basis of gains f-Vom pre to ^ost testing periods. A practical situation 
where the low reliability of a 'difference score causes problems is that 
of performance contracting.^ Even without any real change it is possible 
to find substantial numbers of individuals with large difference scores 
due simply to the low reliability of these scores. Stake (J.971) has 
illustrated this problem. He concluded that owing t;o nnreliability , 
gain scores can appear to reflect 'learning that actually does not occur" ^ 
(p. 587). . ' • ^ ^ 

' " f 

Lack of Common Trait and Scale (e.g., Bereiter, 1963; also Cha^Dters 
5 and 7 of this report) : It would ha-rdly/be sensible to Estimate a 
person's gain in weight by subtracting the number of pounds he weighed 
at time 1 from the number of ounces he weighed at time 2. To make sense 
the same scale units mustnre--ua^d at both points in time. Similarly, 
it would make no sense tcT^ subtracT'a^-e^asure of height, from a post 
measure* of weight to get an estimate of weighT-^afe-^^^^i^J^^ to 
measure weight at both points in time, . * ^ 

.The .need for a common scale and trait at pre and posttespi^^ periods 
which is sd^bvious with the above physical exa^pl^s-rlrS^"sometimes less 
obvious, but no less essential, in an edupatrlo^al context. For example, 
if arithmetic test A was used as the ^j>r^measure and arithmetic test B- 

' as the post measure 4.t might be forgotten that th^ units of the two tests 
are unequal. Even, more likely^^ might forgotten that test A consists 
primarily of addition proble;^/ while test B consists largely of subtraction 
problems. Under such conditions the difference scores would hot necessarily 
be measuring gains al9ir^the dimension measured by test A any more than 
the difference scor^ in the two examples involving weight measure weight 
gain. Even whe^e^he samfe test (or parallel forms) is used as the pre 
and post measufes it is sometimes the case that different constructs 
are measured' at the two points in time. For e'xamp-le, an item which measur^ 

'problem. solving skill at one point in time may measure memory at a i^tei 
point in time. 

Residual* Scores 

Problems inherent to difference scores have led a number of people 
to seek altenatives. One of these is the residual score which is largely 
ntotivated by the desire for a score that has a- zero correlation with the 



GO 



Table 4-1 < , 

Difference Score Reliability a Function of the 
Reliability of the Parts and Their Intercorrelation 

Reliability of Pre and of Post Scores 
Correlation (assumed to be equal) 

of Pre and 



Score 




.7 


.8- 


.9 


.5 




.40 




,80 


.6 




.25 


.50 


.75 


.7 




.00 


.33 


;67 


•8 . 






. .00 


.50 


.9 




* 

t 




.00 



Assuming p , = p t ~ ^„ • 

^ xx' yy X y 



61 



A^5 



pretest (buBois, 1957; Manning & DuBois, 1962). As noted by CroftferSch 
and Furby (1970) ''One cannot argue that the residualized score is a 
'corrected' measure of gain..." rather it "...is primarily a way of 
singling out individuals who have gained more (or less) than expected" 
(p. 74). * • ^ ' 

A residual score, R, is obtained by subtracting the predicted post- 
test score, Y,' from the corresponding, observed posttest score, Y. The 
predicted posttest score is obtained\f ron the linear regression of Y 
on the pretest, X. The zero correlation between X and R follows inirne- 
diately from the way in which R is derived and is seen as a major ad;^^!!-- 
tage over diff-erence scores because residuals do not give an advantage to per-^ 
sons with certain values of the pretest scores whereas, difference scores do. 

\/hile solving the problem caused by the correlation between differ- 
ence and pretest scores residuals, like difference scores tend to be 
unreliable. As indicated bv O'Connor (1972) the reliability of a re- 
sidual score can be written as 



p , - p (2 - p , ) 

1 ^2 



xy 



used in Table 4-1 . 




Values of the reliability of resltk^l scores l^&sreported in^ag^ 4-2 
for selecCed values of o and p^^ und^^^the assvtoption that Pji^J ■ yy' 



The values of p and P^^^used in T^KLe-A-2 are the same-^t^^ios 




Although^ the residual score re liabilitie^s shown in Table 4^2^.41^ 
somewhat better than the corVesp-^rt^ding difference score reliabilities^ 
shown in Table 4-1, they are still disappointedly ,si3aU whenevet the 
'correlation of pre and post scores is large. Furt^h^rmore,; residuals 
are usually of most Interest in situations'where the\pre-post correlation 
is large relative to the reliabilities of the parts, '^us , the sane 
cautions due to unreliability of difference scores also apply to resi- 
dual scores. 

Estimated True Change 

Another alternative to the raw difference score approach is to 
estlinate "true" change. In other words, the change that would be 
obtained if there were no errors of measurement is estimated. ,The 
true change is presumably the quantity of real interest whenever an 
af tempt is made to measure change. 

In tlie case of a single measure there is a perfect ^correlation 
between the estimated true scgA;e for that measure and ,the observed 
score^. Hence, fol most purposes th^ observed score serves just as 
well^s the estimated true score • VJhenever two or more measures are 
available, however, the estimated true score based on all available 
inf^rm^ion will ordinarily have a less than^ perfect correlation with 
the observed score of the measure. Ija the case of a difference 
score both the pretest and the. posttest , .and if available^ other 
scores as well provide information about the true difference score 
and the resulting estimated true score may result in noticeably . 

V 



Table 4-2 

Residual Score Reliability as a Function of the* 
Reliability of the Parts and Their Intercorrelations 



Correlation 
of Pre'^^Bad^ 
Po st-Scores 

^5 

.6 

.7 

.8 

.9 



Reliability ofTre and of Post Scores 
(assumed to be equal) 



.7 


.8 


.9 


.50 


.67 


.83 


.36 


.58 


.79 


.12 


.42 


.71 




,09 


.54 






.05 



Assuming p^, = Pyyi • 



\ 



4^7 



different -ranking of individual thaAwmald be obtained from r^w dif- . ^ 
ference scores. ' , 

\ Regression Estimateg (Lord, 19%^, 19^58, 1963;. McNemar, 1958; Cronbach 
aSSsWrby, 1970; Marfes J^ Martin, 1973^: Givi^u^'timates of the reliabili- 
4es of the pretest and .of^ th«-.posttest as welXas their variances and , 
thH^r ^.covauc^iance, it is possible to obtain estimHes of true gain using 
multiple regression procedures. The basic forraul^^^^^ be found in Lord 
(1963, p. 28). Cronbach and Furby (1970) ext^^nd th§s^armulas by dis-- 
tinguishing between linked measures (i.V> ones withX^rt^e-tate^^^errors) 
and independent measures. They also tonsxdax^the^^possibiity of us^B:g^ther 
available measures as predictors. 

As Lord (1963) has shown with an empirical example, persons^with tha 
largest estimated true difference scores are not^necessarily those with 
the largest observed difference scores, jia particular, persons with*^ 
relatively large pretest scores are ^i^io-re apt to be among those with "large" 
gains v;hen estimated p:aae...dif ference scores (Lord, 1963, equation 3) are • 
used than then raw difference scores are used. Thus, the estimated true 
difference scores obviate the objection that difference scores tend to 
'favor persons with low pretest scores.- ^ 

As noted by Cro;ibach and Furby (1970), it is not necessary to limit 
the estimation to the measures involved in the dif ference^ score. Any 
Measures that are available may be used along with the pre and the post 
measures to estimate the true dif ference \score. As shown by Tatsuoka 
(1975), the additional measures will improve the prediction of the true 
difference if ^they are correlated with the errors of measurement on X 
' and/or Y. In practice, the addition of>inore predictor variables would 
probably improve the accuracy of the estimate^ relatively little unless 
the pre and post measures 'were of low reliability. 

The reliability of estimated true change is equal to the squared 
multiple correlation of true change with the predictor variables, i.e.,"* 
with X, Y and possibly other meas^ures. It will always be as large or 
larger than the reliability of a simple difference score (Tatsuoka, 1975). 
When p^x' " Pyy' = ^y the^ reliability of the estimated true 

%<Ji^ference scores equals that of the raw difference scores (see equation 4-3), 
*If the pre and posttest reliabilities and/or the pre and posttest variances' 
ar^ unequal then the'reliab'ility of. the estima^ted true difference scores 
will exceed that of raw difference scores but typically only ^^Ughtly. 
For example, if ^^xx^= -^^^ °yy* " -^^^ ^ ^-^^ ""y " ^-^^ ^"""^ ^ * 
' \ then the reliability of the r^w difference score computed from (2) is 

.600 which can be compared to the reliability 'of the estimated true 
difference score of ^613. 

Linked vs. Independent Observations (Cronbach & Furby, 1970; Werts, 
^Jtifeskog & Linn, 4972): Allof-tUe preceding discussion depends on the 
us^ualxassumptions of Clascal test theoxy. In particular, it is implicitly 
assumed that the pretest errors of measurtment are uncorrelated \ath the 



er|c 




1 



A-8 



posttest errors of measurement. Where the same instrument is used to 
obtain both pre and post* measures the assumption of uncorrelated errors 
of measurement may be especially dubious. Thus, it is desirable to use 
estimation procedures that allow for the possibility of corre lated er rors 
of measurement on the pre and posttest. To do so, however, requires tH^ 
avaliat^ility of more information in the form of multiple measures than 
is often available in practical settings. 

Cronbach and Furby (1970) have formalized the distinction between 
linked and independent observations. They distinguish two types of 
error components. j^For linked observations (e.g., the'same form of a 
test used as both the pre and the posttest) one type of error component 
would be assumed to have a nonzero correlation. On , the other hand, inde-^ 
pendent observations would b-e assumed to ^o^eboth types_^ exiror components 
uncorrelated. ^The^distfinction between linfc^^^^nd' l^ulep^dent obsetvatibns 
leads to different foi?fciulas for estimating the reliabilities of difference 
scores and true change. Basically the formulas require that a. distinc- 
tion be made between the correlation of X and Y where X and Y are linked 
and where X and Y are independent observations.* Furthermore, separate 
estimates ol the linked 'and the independent observations* p^^ are 

required, * , 

» 

Correlates of Change 

> Frequently the focus in measuring change is not on the individual 
difference scores button their correlates. The interest is in finding 
variabiles that predict the amount of change. Measures of change may 
sometimes be computed for individuals as a means to the end df. correlating 
these measures with other variables. Frequently,- however, the ^change 
measures need ,not actually be cdtaputed to obtain the^^desirefd correlations 
of these measures with other variables. 

The alternative approaches to measuring change ^result in different 
correlations of these measures v/ith other variables. The diff erent , es- 
timates have different theoretical and practical implications. 

Spuriobs Correlations (Lord, 1963). Earlier the tendency for a differ- 
ence score to have a negative correlation with the pretest was noted. 
More generally, the correlation of a raw difference score with another 
variable that is partially a function of the pretest or posttest is 
usually considered spurious (lord, -1963, p. 33). The spuriousness is 
the result of the same errors of measurement occurring in the difference 
score '^d in the variable, with which it is correlated. In the case of 
the correlatio.^ of D = Y - X with X,the same errors of measurement that 
are positively weighted f or^ X are negatively weighted for D and the 
result is usually a spurious negative correlation. 

Attenuation ..(Lord, 196^): ^Un^reliability has the effect of attenuating 
correlations. This is trufe of all fallible measures but becomes of major ^ 
importance when th« reliability of a variable is qufte low as is typicaliy 
the case for measures of ^change. The practical implication of the large 
degree of attenuation that is typically encountered with difference scores 
.is that corpelatioiis involving a difference score will tend to be quite 




0 J 



4-9 



low "which is^Te^tlier discouraging for someone who is interested in finding 
correlates of chang^^ » * ' - / 

* * 

'Part and Partial Corrections : ^ If residual scores rather ^^an 
difference scores 'are used in ><^rrelati6nal studies, the result is *the 
same as a pai;t correlation. That is, fhe pretest score, X, is partialled 
out of the posttest* score, Y, and the residual is correla^ted witjh a^hlrd 
variable, VJ. Note that X is not partialled out of W but only out of Y. ^ 
The result is called a pirt correlation. Thus, X is held^ constant statisti- 
cally with respect to Y but not with respect to W. ^ ^ n , : 

^ A more f amtUar^jCorjelational approach is to partial X out qf both 
Y and W. The result dJc^lTed---*Si^i^i correlation and has ^a somewhat 
simpler interpretation than the par^^Tcdr^^elation ^ince X is h^d^consrant ^ 
* statistically wLth respect to both rand W^in^t^d for just one of them 
as in the case of part correlation. If X,'Y and 

normal distribution,' then^ the partial^cor-relation ,af >-4nd.W with X 
partialled out is simply equal to the correlation be tweeI^A^[an^, Y for 
any fixed value of X. This would often seem to be a coef f ici^t 
interest where the focus is on correlates of change from i^e to^posftes tiiag 
period^ As p'reviously noted, however, residual*scores c^ot be ^e^sid-- 
ered aFt>etter measures of change. Xhey merely represent th^-t>-p^art d^^^^a^^^ 
score that is not linearly predictable from the variable that^is .partia^ed 
out. -l^onetheless, the partial correlation provides a means of i*entifyin 
variables that can predict posttest scores of individuals with equal 
pretest scores. \ ' ' 

■The problem of unreliability that runs throughout the'neasureme^>4; p£ 
change is also a, major concern with partial correlation. Tiie direction" 
of the effect of unreliability on a simple correlation is known in advance. 
Unfortunately, this" is not true of partial correlations (Lord, 1963, p. 36). 
In the case of .partial correlations, the effect of errors of measurement 
< may be to change the sign of a partial correlation. As shown by Linn and ■ 
Werts (1973) it is possible for errors of measurement to result in a_ 
partial correlation of zero where the partial coi-relation ' among the error 
free-measures is non-zero. For these reasons, it is particularly important 
to make corrections for attenuation when using partial correlations. 

' Partial -Rggression Weights : Withii^ the context of a linear model, 
' the relationship of a variable, W, with 'change might be evaluated in terms 
of the regression of the change on W and the pret-esf. Werts and Linn (iy/U) 
have shown that the resulting partial regression- weights can be readily 
obtained^from the partial regression coefficients in the regression of 
the posttest on W and X. Hence, there is no need to actually use difference! 
scores. This is true with or without corrections for unreliability of the 
measures. • 

Recommendations (Indi>>ldaal Differences )? ^ 

One of the most common uses of change measures is as criteria in 
correlational studies. Tlie goal of such studies is the^dentif ication - 
of variables that predict who will gain the most in a particular a.ituation. 




^3 



6 3 



\ 4-10 



Cronbach and Furby (1970) argue that it is preferable to phrase such 
questiorvs in terms of partial correlations rather than cotrelations ^ 
involving difference scores or in terms of part co|^sel^tions . We concur 
with- this recommendation. Regardless of the way i^whicl'i such questions 
arfe phrased,, however, it is important to take the tmreliab ility of the 
measures into accounts 

In the case of partial correlations, taking, the unreliability into ^ 
account "...poses somewhat of a dilemma, since, first, it is often hard 
to obtain the particular kind, of reliability coefficients that are 
required for makin-g-fehe^aHprapriate correction, and, further, the partial 
corrected for attenuation .may be seriously effected by sampling errors. 
These obstacles can hardly justify the use of an uncorrected coefficient 
that may have the wrong sign, hovWer, (Lord, 1963; 36)." 

Two other possible cases of cliange measures relating to individual 
differences that are discussed by^Cronbach and Furby (1970) are the 
identification of individuals with unusually large (or small)* gains and 
the use of change measures as theoretical constructs. In neither case 
are change scores needed. In tfhe former case the regression approach 
outlined by Cronbach and Furby is preferred. In the latter case, linear 
combinations other th^n simple difference scores, with the arbitrary • 
Jcits of plus and minus one, should be allowed (Cronbach and Furby, 

GROUP DIFFERENCES (INFERRING TREATMENT EFFECTS) 

Questioits about the effe,cts of exp-errina^RXai. treatments of of variables 

invoTveH in observational studies are> frequently pH'fas^d.^in terms of gains. 
For example, does treatment A result in a, larger gain thatir^treatment: B? 
Dc^ students in integrated schools gain mOre than^ students in -segr^ga-eV-^ 
schools? Do students in *'open"" classrooms gai^ more than those ' in "Ki 
tional" classrooms? Although these questions seem intuitively reasonc 
it does not follow that the best* approach** to trying to answer them 
Invoivfe the use of measure? of change as dependent variables. Indeec 
"...There appeals to be rfo need to u6e measures of change as dependent 
Variables and no virtue In using them (Cronbach and Furby, 1970, p. 78) . 

' An "important distinction among investigations' aimed at inferring 
treatment effects mu§t be made between studies^''t2i-a4:lJiai:e-i^aft^^ 
and those that don't. For Studies with random assignment a pretest serves 
"'primarily as a means of increasing statistical power. \^eve treatment- ^ 
groups are not formed by random assignment it iSj often hoped that the ^ 
pretest will provide a means of allowing f or^preexis ting differences. 

Random Assignment ' " v ^ - 

When treatment groups are fgrmed by randomliy assigning individuals 
(or more generally units) to treatment; conditions, the posttest alone^ is 
perfectly suitable a^ a dependent variable. A test of the null hypothesis 
of equal posttest means for the treatment group$ is appropriate for eval- 
uating treatment efteots. . If pretest measures hre available in this 
context their potential usefullness is best evaluated in terms of the 
effect of each use on the power of the statistical test. 



4-11 



A pretest may 'increase the precis ioao^E 4n experimentV<^^e extettt;^to 
which experimental precision is improved depends^on the way\J.nNyhich the 
pretes^t information is used as well as on the nature ^d magns^tu^^vOf the \ 
pretest-posttesD rePationship. Differemc^ scores are TOe^ossibilr^ but 
not the only olfe . Feldt (195*8) compared three potential iXes ofN^ncbmitant 
variables: (1) blocking, (^) analysis of variance on difference scenes, 
and "(3) analysis of covariance. Ue clearly shows that among these tht 
approaches that the difference score approach has the leasts precision; 
Thus, on*^ the basis of precision the choice would ordinarily be betwG 
blocking and, the analysis of covariance with the ana^^^is of covariance 
being the most precise where the correlation betweefi th>s. pre £uid posttestv^ 
is greater than .6 (feldt, 1958).' (' 

In pretest-post test designs the correlation between the pretest and 
the posttest is frequently .7 or higher. Thus, the, analysis of ciy^ariance 
would seem tp be an attractive approach to the analysis of such datav 
Before this technique i^ wholehearteclly accepted, however, several liml^a-. 
tions- of the technique need to be considered. As IJ.ashoff (1969) has 
argued^ theranalysis of covariance is a;'*delicate instrument". Elashoff 
notes that the analysis of covariance involves a number of relatively 
strong assumptions and violations of some of these assumptions' may in yali- 
date the technique. , Uliere the assunptions pf linearity or of homogeneity 
of regression seem questionable it nay be preferable to use the pretest 
^ as a blocking variable rather than a coyariate. In any event, however, 
' there seems to be little justification for using difference scores. 

Another 'as sumption of the analysis of covariance is *that the covariate" 
i^ measured without error. Violations of this assuitiption are most' tro.^le- 
-some i'n situations where groups are not formed by random assignment and 
will *be considered again in that context. Even wid^i random assignment ^ ^ 
errors of measurement linit the value of traditional analysis of covariance. 
•But, te^niques are available <f or allowing for errors measurement in c 
the cov^riate (Lord, 1960; Porter, 1967). 



Preformed Groups 

Random assignment is seemingly impossible in many -situations where v 
answers to questions about treatment effects are sought. Children cannot 

'ordinarily be randomly assigned to schools or to major programs Such as - 
Head Start.' Lven^ if such random assignment were administratively feasible, 
±V might not be desirable on grounds other than 'the desire foq: a clean 
experimental design. Without random assignment it is, of course, possible 
that differences that may be observed in the pos.ttest score are thfe result 
of preexisting group differences rather than* treratment effects. VJhat is 

'desired is a means of allowing for preexisting grpup differences. Ut is 
the hope of achieving this goal that often leads to Ifhe use of difference 
scoVes or the analysis ^o£ covariance. \ 



Lord^(1967, 1968) has provided a compelling Analysis of the use \ 
difference scores or the analysis/of covariance to infer treatment ef fect^\ 
from studies involving pre f ormecT groups, lie has clearly shown, th^t the 



ERIC 



.0 • 



4-12 



<d 



two approaches can lead to contradictory results. The basic problem is 
one 0^ making the proper adjustment for any preexisting differences. 
^Unfortunately, there is ao way of knowing which of these or any other 
techniques provide the proper adjustmen^ts . According to' Lord "...there 
simply is no logical or statistical p*rocedure that can be counted on to 
mak4 proper allowances for uncontrolled preexisting differences between 
groups (1967, p. 305)." i 

This discouraging conclusion is also«^ reached by Meehl (1970) and by 
Cronhach and Furby (1970) among others. Without assurance of proper 
adjustments for preexisting differences, there is necessarily a concern 
about the possibility that, treatment effect^, however obtained, may be 
subject to major^ sources of bias. In order to evaluate the bias in 
various nonexperinental research situations it is important to have a , 
clear understanding of what is meant by a treatment effect. Rubin (1972) 
has provided a definition which is useful from a formal point of view 
as welL^s bei^g consistent with intuitive notions of a "causal effect." 
His bdsic definition of an dffect is specific to each unit' (e.g". , individ- 
ual ^udent', classroom, school) under consideration, to a particular time 
interval (t^ to a^d to a particular pair af treatments (e.g., experi- 
mental aYid cont^rol). The effect of the experimental versus the control 
treatment on a dependent variable, -K, is the difference between the score 
on X that would^ have been obtained by the unit at t2 if. the experimeatal 

treatment had been introduced at t-j^ and the score on X that woi^ld have ^ 
been obtained by the unit at control treatment had been intro- 

duced at tj^. . ^ - * 

In' practice it is impossible /to measure the effect defined ^ove 
for a"hy unit because only one treatment can be introduced at tj^ and it is 

inipbssible to return to that time to introduce th^ o£her treatment. Nor 
is** tt possible to ciea^ure the average effect of all units for the same 
reason. Nonetheless 'this° form=ulation is useful because under random 
assignment of units- to treatments, the expect^"V:alue of the difference 
in me^ scores on.X is equal to the average difference that would be 
observed^if all units could be obs^ved under both treatment conditions 
duriog the~same time interval.. Thus, the sense in which the randomized^a ■ 
e^p.eriment provides an unbiased estimate of the treatment effect is ^ 
clear. Furthermore, a framework is provided for considering factor^ in 
nonexperimental designs that "result in biased estimates. 'In this way it 
may sometimes be possible to specify conditions under wich estimated 
. treatment ellec^s may be biased in one cTirection or the other or to clearly 
specify thfe a priori assumptions that would have to be satisfied for the 
estimate to be unbiased. ^ 

' f 

One of ^the many potential sources of bias in estimated treatment 
effects from the analysis of covariance is dufe to errors of measurement 
in 'the piifitest (Porter, 1967; Werts and Linn,"^ 1971). The effect of un- ^ 
reliability ih the covariate is a ^reduction in the slope of the regression 
of the dependent variable on the civariate. Wliere there are preexisting 



\ ■ 



ERIC 



6 J 



4-13 



differences in the group means on the covariate the reduction in slope _ 
leads to bias in -the estimated magnitude of the treatment effects. _ 
The difectionof the bias due to unreliability of the ' covariate can 
be-~d.eterminod and if adequate estimates of foe cavariate reliability 
can be'obtained, the procedures outlineH by Porter ^can be used. 

Single Group Designs 

For a single group such as a school or classroom there m*y be an 
interest in the amount of change that occurs during a given time inter- 
val. Once again there is no real virtue to difference scores (Cronbach 
.and Furby, 1970). A simple 't-test for dependent samples will provide a 
test of the null h:y;pothesis that the mean pretest score equals the 
mean posttest differences*. 

While such differences may be due to the school experience they 
might also be due to a host of non-school experiences that students have 
during the interval between the pre and pos.ctests.' An observed differ- 
ence may be attributable to variables a&sociated with increased chron- 
ological age which have nothing to do with fechopl effects per se. It 
would be desirable to separate differences in test scores that are 
associated with chronological age. Goulet (in press) has proposed an 
approach ttrat is si^ecif ically designed ^or this purpose. - 

«^oulet (in press) suggested a sampling procedure that wa»ild -provide 
for independent estimates of effects associated with chronological 
• age and those associated with amount of schooling as well 'as their in- 
teraction. His design would require that nonoverlapping random samples 
of students be tested at different points in the school year. The 
students' scores would then be categorized according to chronological 
age and time of testing. A simple design ■ involving four different 
samples of child/en is shown^below. 

A^e at - Time of testing 

Testing Date • * Sept . Jan . 

7-3 « A ' ^' 



7-7 



D 



fhe means based on subsemple's A, B, C and D above pVovide the basis 
for estimating effects associated with/schooling that are independent 
of affects associated with age. As indicated by Goulet, (in press) 
the desired estimate is Simpfly • • \ 



where the X's refer to the ^ubs'ample means. Goulet's suggested approach 
• does not guarantee that the .es timated effect is- due to school. It s-till ' 
might, for example, bethe rfisylt of factors outside the school experience 



4-14 



vhich\ovary with that experience. It. demanstrates, however, an ap- 
proach for separating two major sources of competing hypotheses about 
clusters of variables that might influence pupil performance* By hold- 
ing constant Sources of -variance associated with age', the estimates of 
"school effects" are mych more compelling than when €he estimates in- 
volve a combination of school associated and age associa^d effects. 
A more complete discussion of sampling designs such as^^e above is 
provided in Chapter 2. 

ACCOUNTABILITY SYSTEMS BASED ON STUDENT ACHIEVEMENT 

There may be fairly general agreement with the conclusion stated 
by Lord (1967) that there is generally no way of knowing what adjust- 
ments should be made to allow for preexisting group differences. None- 
theless many practical decisions must be made without the aid of ran- 
domized experiments. These decisions must be made on, some basis. Even 
with all'of' the pitfalls that are encountered in trying to interpret 
information that can be gleaned from data collected for preexisting 
groups, it still often seems -to be the best alternative. I 

Responses to pressuir^ to accountable have taken inany forms. 
Educational accountabilio^has many meanings .and as Glassj (1972) 
has indicated not all^of the uses of the term. require thd measurement i 
of student per f ormmce. One of the more common iutferpretjations , however, 
is that educators shour^ be accountable fOr what students learn. For 
this interpretation of accountability the results of standardized 
achievement tests would seem^a natural source of infonna 
for assessing current status but for evaluating progress 
ately, there is great potential for misuse of standardiz- 
for purposes of educational accountability. _ . . 



Norms as Standards 



ion not only 

Unfortun- 
ed test results 



provide essen— 
the items must 
sufficient 

correct would 
fficient basis 
good.' There 



Knowing only a student's raw score on a. test would' 
tially no information. To derive meaning^ the content of 
be known in some detail. If the content is described in 
detail then a statement that a student got 20 of 40 .item^; 
begin to take on some meaning b-ut would stS^jfLl not be a suf 
for answering a parent's question about whether that w^6 
are two major approaches that are commonly taken to answering' this ques- 
tion: criterion referenced and norm referenced., ' The morW cotoonvOf 
these is the norm referenced approach' which simply prpvid^s a comparison 
of the student's performance to' some specified group. ' Thb norms may 
take the form of percentile ranks, crade equivalents or some other type 
of scaled score but basically the narms provide a means of interpreting, 
a students' performance relative-'',t'o that of otjier students. 

Grade Equivalent Scores : A problem with the^ use of norms Is that the 
norm is sometimes confused with the standard or ideal. It is obvious 
that not all children can be abov^ the 50th percent^-l^.^ Lt should be 
just as obvious that not all schools can be above the 50th percentile 
of school mean liprms . ^Then grade equivalents are used it is still the 
case that not 'all children (or schools) can be above grade level but - 
this may be less obvious with grade equivalent scores, than V^th some; 
other types of scales. . ' • . 



\ 



7i 



4-15 



The grade- equivalent score suffers froig a number of defects (see 
for example Angoff, 1971). Most of these de'feirts stem from the surplus 
meaning that is attached to the label. Because of these defects i» the 
grade equivalent the latest version of the Standards for Educa tional and 
Psychological Tests reconaends that they be discontinued or their use 
discouraged GVPA, 1974) . 



Change on Achievement Test Scales ., Regardless of the nature of the 
scale that is used, scores at a singL&-T3^t in time could hardly be ex- 
pected to provide information abo\it/the effectiveness of a school. The 
notion that educators should be accountable for student learning has 
implicit in itf the notion of chans^. True, a measurement at a single 
point in time nay provide inforo^ion about strengths or weaknesses but 
it cannot be expected tto indic^e by itself the amount of progress that 
was made in any gi^^en intej?^ of tine. To do this something must be Known 
about past as well ar'pj^esent performance. The desire to know sometning 
about progress brjU^gs us^ack to'^ur concern about change from pre to 
posttesting periods. 

Probably the most widely' used scale for purposes of evaluating pupil 
growth is the grade equivalent scale (see for example, IJargo, et al ., 
1972). The deceptive' simplicity of- grade equivalents makes them appear , 
particularly useful for the purpose of measuring growth. Lindquist and 
Ilierononous, for example, say that "Grade equivalent scored are best 
suited for ^easuring growth from year to year (1964, p. 13)." 

Although Lindquist and Hieronomous go on to discuss limitations of 
grade eauivalent scores, thes? limitations are often overlooked*. One of 
the potentially misleading characteristics of grade equivalents is that 
-the^ seem to provide a standard of "normal," growth. If educational 
accountability is interpreted to mean that someone should be responsible 
for the progress or lack of progress displayed by students, then some 
notion of satisfactory progress is needed.. To many people, the grade ^ 
equivalent seens to provide the standard. That is, the gain of one grade 
equivalent in 3. year's 'time becomes tlie standard to be expected. Un- 
fortunately, however, "...a year's progress in a year's time means dif- 
ferent things to' a teacher whose class begins the year .near or above 
grade level and a- teacher whosie class begins two or three years below 
grade level (Rosenshine and McGaw, 1972, p. 640)." 

Some of the probllins encountered in trying to interpret gains on _ 
standard-ized achievement scales may be illustrated by the followirhg 
example results from a school system. An attempt was, made to look at 
the gain in achievement test performance for students in three broad 
categories of ability as measured by IQ test ^cores. Standardized 
•achievement test data were obtained for students in grades 3 and 6. Re- 
sults were also obtained for these same students the following year wh^ 
they were in grades 4 and 7. Grade scores .or grade^ equivalents were then 
reported in reading and in 'a&ith.metic at each point in time' and gain 
scores were computed over the one-year interval. The mean scores and 
mean gains were reported separately ^by school and for students with 
IQ's of 114 or above, those with IQ scores of 98 to 113,^ and those with 
IQ's o£ 97 or less. This was done for each school and for the school 
system as a whole. ' ^ 

- ) 



4-16 



For the school system as a whole, the gains for each of the IQ 
levels (L, M and H) are plotted in Figure 1 for reading and for arith- 
metic • Section (a) of Figure 4-1 shows the results for 3rd to 4th grade 
gains on the >letropolitan Achievenent Test (Harcourt Brace Jovanovich 
1970). The gains observed for 6th to 7th grade are based on the Edu--^ 
cational Development Series (Scholastic Test Service, 1969; 1971) • 

From section (a) of Figure 4-1 it can be seen that from grades 3 to 
4 the largest gains in both reading and arithmetic were made by the high 
IQ group and the smallest gains by the low IQ group • As would be ex- 
pected, the high ability students had a higher mean test score on the_ 
pretest than the low ability students • At the time of the second test , 
the^gap between the two extreme groups of students .hadf^widened. In 
reading, the gap between the two groups was 1.5 GE units at grade 3 
and 2.4 GE units at grade 4. The result is quite consistent with the 
expectation that "the rich get richer and the poor get poorer." It is 
also consistent with the results that have been reported indicating 
that, as measured by standardized te^ts, the gap in achievement between 
highland low SES or between minority and majority groups ^tends to in- 
crease with grade level. 

The increasing gap in achievement between different SES or ethnic 
groups has been interpreted to imply that the schools, are differentially 
effective. The counter part for the illustrative schooL system is that , 
the system is ^lore effective with high than low ability students. ^ 
However, there are many reasons why such a conclusion may not be jus- 
tified. Some of these reasons^are discussed below but first the 6th, 
to 7th, grade results need to be considered. t 

Between grades 6 and 7 the mean gains in grade scores on the reading 
test of the Educational Development Series -wer^: .6 for the high IQ group 
J for. the middle group, and 1.3 for the low group (see Figure 4-1). 
bie pattern is just the reverse of thaC found for grades 3 to 4. ^ 
arithmetic, the grade^ 6 to 7 pattern was again opposite that of the grade 
3 to 4 pattern with gains of .7, .8, and 1.3 for the Jligh, middle, and^ 
low ability groups, respectively. Consider the naive interpretaction or 
these data— at grades 3 to 4 the schools might be considered to be- 
more effective with the more able children but at^ grades 6 to 7 they 
might be considered to be more effective with the less able. Further, 
imagine the sort of comparison that might be made among school buildings 
or among teachers with a predominance of children ftom different ability 
levels if the school building mean gains were compared. 

In the example jUst given there are many diffetencds between the ^ 
data at grades 3 to 4 and those at grades 6 to 7'. They are based on 
tests from different publishers .which have different content specifi- 
cations and different norm gtoup^% .and they are based on different types 
of scales (grade equivalent in one case and grade scores in the other). 
They also differ in that the same test form spans grades 3 ^and 4 but dif- 
ferent levels which had to be vertically equated were used at grades 6 
and 7. These differences may be more than sufficient to explain the seem- 
ingly strange results that are shown in Figure 4-1 (Lmn, 1974). 



7 .j • 



4717 



' _ GAINS IN 

£ GRADE EQUI^VALENT UNITS 



^ o ui b oi 

CD IS.. , 1^ T 1 




1. ^ . • * 

CO V 



4-18 



ERIC 



Gains i\\ Grade Equivalent Scales > The abpve result^ which indicated 
that students with high pretest scores at grade 3 tended to gain more than 
their cpuncerparts with low Drgtest scores may seem contrary to what would 
be expected from knowledge pf correlations of gain scores with pretests. 
As indicated early in this paper gadri scores tend to have a spurious 
negative correlation with the pretest score. The negative correlation of 
pretest with gain comes about when the pretest standard deviation is 
greater than the posttest standard deviation times the correlation be- 
tw^n pre and posttests. This will necessarily be the case whe^iever the 
pretest and posttest have equal standard deviation^. A property of the 
grade equivalent scale, however, is tnat th^ standard deviation of |rade 
equivalent scales tends to*^ increase with grade level and this increase 
in standard d^viation^ is sufficient to result in a positive correlation 
between pretest scores and gain scores. 

The property of increasing standard deviatiori-s for grade equi- 
valent scores at successive grade levels is illustrated by approxima- 
ting these standard deviations at two grades and for two subtests o?. 
three widely used achievement test batteries. The standard deviations 
were calculated by assuming a normal distribution of grade equivalent 
scores and 'subtracting the grade equivalent corresponding to J:he fif- 
tieth centile from the one corresponding to the eighty-fourth centile. 
The test- batteries that were utilized are *t;he California Achieve- 
ment* ,Tests (CTB/McGraw Hill, 1970), the Stanford Achievement Tests 
(Harcourt, Brace Jovanovich, 1973) and the ftetropolitan Achievement 
Tests (Karcourt, Brace Jovanovich, 70) .^^or the reading subtests of the' 
three test batteries, the estimated standard deviations for^grades two 
and six for the above test-s batteries changed fron .925 to 2.27, from 
1.70 to 2*^5, and from, 1.0 to 2.4; The ^tade two and six standard de- 
viations fot the arithmetic subtests of the three batteries changed from 
.773 to 1.57, from' l.'O to,2.05'and from .7 tp l.A. In^erigrai,*>the ^ 
estimated^standard deviations for grade six are roughly* double those 
for grade two and the necess^r^ condition for ajxo^tive correlation 
between pretest and gain isT seen to exist. . >^ 



Thus, the naive expectation of a gain of one grade ^equivalent unit' . 
in a yearns time ignores' the positive correlation bet^ween g^in and pre- 
test that has been observed for ^:he grade equivaletit scale, "...nor- 
mal or typical growth is often defined as 'bne year^^(1.0.) in grade^ * ' 
equivalent units for every school year ^of instruction. However,.. 1.0 
year of growth is typical only for students near the middle of the dis- 
•tribiition (Prescott, 1973, p. ,55)." As shown*by Prescott, by (Coleman ^ 
and Kar\7eit (1970), .and by Wrightstone, Hogan and Abbott (undated) , 
students who maintain a constant percentile rahk over several years would 
show* average .^gains that are considerably different than 1.0 when the 
constant percentile rank deviates substantially from 50. 

In ordefr to investigate the generality of the above tendency, th'e 
grade equivalent score deviations from grade level fqr hypothetical ^'tu- ' 
dents with constant percentile ranks of 20 and of -80 we^e plqtted for 
several different tests for grades ^ through 6. Theseresults . f o^ the 
reading and arithmetic tesA' of three widely used' acU.^^^WJient ^test bat- 
teries are shown in Figure'4-2. The test batteries for which *data ate 
plotted in Figure 4-2 are the Metropolitan Achievement Tests (Harcou^t 
Brace Jovanovich, 1970), the California Achievement Tests (CTB/McX}raw, 
1970) and the Stanford Achievement Tests (Harcourt Brace Jovanovich, ^973) 




4 




4-20 



The graphs shown in^<E:igure 2 provide the basis for several general- 
izations: (1) the average growth 'ifkqiliri^ to maint^n a constant per- 
centile rank of 80yj^ considerably more than 1*0 g#ade equivalent unit 
per year, (2) the average grcjwth requi-red to maintain a constant per- 
centile rank of 20 is substantially lefe^"-4;h^n l^O'gr^de equivalent 
unit per year, (3) tl4e average gain in '^radeSquivale;it units required 
to maintain a constant percentile rank of 80 is less for arithmetic 
tests* than for reading tests, and (4) the average gain in grade equi- 
valent units required to maintain a constant percentile srank varies sub- 
stantially from one test publisher to another* 

Based on the results shovm in Figure 4-2 the 3rd to 4th grade gains^^ 
for the illustrative school in Figure 4-1 are quite consistent with what 
would be expec1:ea. The results for grades 3 to 4 certainly are dependent 
on particular characteristics of the grade equivalent scale that are not 
really fundamental to notions of student performance. Thus, the result 
that the more able students tend to gain the most may simply be an arti- 
fact of the grade equivalent scale and the naive interpretation that the 
schools are relatively more effective for high ability than for low 
ability students is suspect* 

A possible conclusion based on the difficulties with the grade 
equivalent outlined above is that percentile ranks might provide a 
better scale for cpmparing growrtr of groups of students that start at 
different levels initially. Percentile ranks, however, suffer from 
other limitations. They tend to spread raw scores out in the middle of 
the distribution and squeeze them together at the extremes. A dis- 
tributign of percent;^ile ranks is necessarily rectangular and the raw 
score distance between the 50th and 55th percentile -is much less than 
the raw score dist2mce between the 90th and 9^th, per centile . Due to 
thi^ limitation of percentile ranks, Coleman and Karweith. (1970) conclude 
that they are not a usefunype of score for measuring the amoiant of change 
but they may be useful for measuring the direction of change. 

. According to the test ma^iual, the ^rade scores J:hat were used to 
summarize the test -results for the school system at grades 6 and 7 
(Figure 4-1) were "... developed irt an attempt to utilize the strong 
points inherent in percentile rank and grade equivalent norms while min- 
imizing the inherent limitations of such norms scores" (Scholastic Testing 
Service 1971, p. 12). Grade scores are obtained* from standard scores 
at each' grade' level with the mean set equal to the grade placeme>it level 
and the standard deviation set equal to 1.0. According to the publisher, 
"Score changes [in grade score units] of, more than one unit indica^ 
relatively rapid grosth as compared with other pupils; score changes 
less than one unit indicate relatively slow growth as compared to pther 
stud^ts" (Scholastic Testing Service, 1971, p. 13). • ' 

A review of grade score scale properties (Linn, 1974) revealed 
several {indSsirable characteristics of this type of scale for purposes 
of measuring change. The most obvious disadvantage of this type of scale 
is that constant r^jw scores^'eF-^everal points in time will result in 
■increasing grade scores and "apparent growth." Furthermore, the magni- 
tude of the apparent change varies from one raw score level to another. 



11 




4-21 



ERIC 



X As far as the results in section b of Figure 4-1 are concerned, 
there are two factors which may readily account for the re_latively large 
gains for initially low scoring students and relatively sjnall gains for 
initially high scoring students. First, by. set ting the standard devia- 
'tions at different grade levels equal a negative correlation between ^ 
pretest and gai\i is insured. The second factor that is relevant to the 
particular situation of the grade-6;to 7 results is that di^^rent levels 
of the test were used at grades 6 fnd 7. As shown by Linn (19 74) 
difficulties in vertically ecr<iating tests"* and the large increase in the 
scaled score equivalents of minimum and chance level raw ..scores when 
the level of the test is chan^d c5uld easily account for the apparently 
larger gains of initially low scoring students than their initially 
high scoring counterparts. Again 'the results of Figure 4-1 do not ^ , 

provide a basis for generalizatlqhs about the relative effectiveness 
of the school system with different gioO^js of students. 

One di»£ficulty with ^vertically equated tests is the large increase 
in scaled score equivalents of minimum and .chance level raw scores when 
the level of the test is chariged is not limited to grade scores. It 
is also a potential problem when grade equivalent scores are used with 
vertically equated tests. Reported in Figure 4-3 for grades 2 through 
6 are the grade equivalent sc&res associated with chance level per- 
formance on the reading and arithmetic subtests of the three previously 
used achievement test batteries. As seen from Figure 3, the increase 
in grade equivalent scores from one^evel to the next for hypothetical 
students who respond &t random, varies considerably across each pub- 
lishers' test and across the two subtests. However, even the ,mi-nimum 
increase olx^ grade equivalent units would result in apparent growth 
for students wFro-<espond at the chance level. 

The Wro ng Norms . \ number ^f other ' difficulties with using norms 
as standards for Evaluating student progrgss_ mighty be mentioned but the 
illustration of pne other problem should suffice,. Longitudinal data 
are often thWht to be preferrable to cross-sectional data because of 
the possibility of cohprt differences and because if you are interested 
in the effect/ of a schoo; it seems reasonable ' to look at students who 
have been in the school "for a given period of tiine. However, the avail.- . 
able"^ nLmative data on standardized achievement tests are cross-sectional. 
Longitudinal samples often suffer from considerable attrition. • Con- 
seauentlV the differences between data for a longitudinal sample and the 
test norms are apt to be differentially affected by selection factors 
at different levels. This can be illustrated by data from a tiational study 
of academic growth, conducted at Educational Testing Service undpr the di- 
rection of Tom Hilton.- The data for the , following illustration wefre" 
taken-frdm the extensive set of Tables reported by Hilton and Beaton 
(1971) and have previously been discussed by Linn (1974). 

The longitudinal sample of approximately 3600 students was divided 
into two groups according to high school curriculum: academic and nonaca- 
demic. The scaled scor4 means on one of the tests and the correspondirt^ 
percentile .ranks of the means are plotted in Figure 4-4 ^r hese two ■ 
groups. The test was the Quantitative Test" of the School College . 
Ability Tests, SCAT (Educational Testing Service, 1957). At the fxfth 
grade the academic group is well 'above the median of the norm group and 
the nonacademic group is slightly above the median of the norm group. 

. 76 



4-22 





o 






»i 






D3 






D- 






(D 






W 












c 




»i 






»-h 


< 




O 






a 


M 




B 


rt) ' 












ft 




O 








CO 






o 




O 


0 




a 


•i 






0) 




H 


CD 




2r 






•i 


> 


OQ 


(D 




C 


(D 


(n 




o 


tt> 


CO 


o 




rr 

to 






a 
a 


ft 






o- 




•i 






a 






H- 


H- 




N 


ft 




(D 






a 






H 






o 






CD 






ft 






CD 


O 






(D 






t- 












< 























J!^ 0 9 



X □ 



CD 
O 
Q- 



CO 


o 




o 


o 




3 




— ♦» 


o 


o 








o. 






o 


> 




o 


> 




o 




rr 






< 




CD 


< 


me 


em 


3 


CD* 










Te 


H 




CD 












_</> 










CO 






^- 




O 


m 






m 






o 






o 











O 
3 

CD 
< 
CD 

3 

CD 
3 



CD 



CD 

-si 

O 
I 

m 



3 

CD 



o 

'3 



GRADE EQUIVALENT SCORES ASSOCIATED 
WITH '^CHANCE LEVEL" PERFO'RMANCE 



o - 

> 
o 

no 
m 

<: 

H 



o 
— I— 



ro 
.6 



b 



b 



T 



> 
O 

m 

> 
o 

,m 

m 



OJ 



I 



A 



ro • 

b. 
— I — 



1 



I >5c- — 



— H 



b 



4i» 

b 



1 

I 

i 



id >c 



ERIC 



7j 



SEALED 'SCORE 



fo K> ro -ro ro oi 




\ 



\ 

\ CENTILE *RANK 




The oercentile ranks of the neansNfor both g/oi^ps drop slightly from 
^Le 5 to giade 7 and .ore sharplJXrom gr^e 7 to grade 9. Between 
Lade 9 and grade 11 the academic gro^ maintains about the same -per- 
centile rank while the, nonacadenic group shows another drop. 

The initial impressions • from Figure 4-4Vre that the nonacademic 
students ani -falling further and 'further behind the academic students 
and both groups of students are losing ground relative to the .atxonal. 
norm. Both of these rfesults, however, may be the ^^^^^^"^"^^ °f , ' 
comon problem encountered in longitudinal studies namely attrition 
^^Hnitial .IS gro.*h study consisted of about 9.O00 ^th g-de students. 
Only about/AO percent of these students had test score data at grades 
5 7, TT^A 11 and the nonrandom nature -of the attrition is apt to 
hive different implications at 5th grade than ^^^^f ^^^^f ' "^es 
ample, 3tuderits^who drop out of school between the 5th and 11th grades 
^e available for the norms group at grade- 5, but not at grade 11. 
?or the longltudina],.sanple they are excluded at.both points in time ^ | 
(Linn, 1974),^ ^ 

Problems due to using cross-sectional norms can 'ai^is^ even where 
the -Witudinal data cover only two points in time wy;Ki;n.,a s^n^le 
fcLin^rr' Data fro. two points in a s^ngl^ ^"^^.l in 
a niajor attrition, problem such as.was encountered for Jhe data in^ 
■ Figure A-A. Xoakheless, using, ^acll data to interpolate the „ 
o^her points in the year may result in misleading "8-d"j\7^^f,f ^""^ ' ' 
For examoie Beck -(19 75) has recently shown that norms based only on 
fril tes-ting tend to^ underestimate the actual spring performance or a 
loigitudSa! sample that is tested it^the.fall and again in the spring. 

Regression- Appr-oaches to Accou ntability^ ^ 

One of the bettlTknown ap^^roaches to 
svsten i3 the one proposed by Dy^r (-1970 ; *Dyer , Li^ln & Patton, l^J^ ' 
.systenj i5 tne one p v ^ described before the term ''accountability ■ ^ 

Ms approach, which was ii,rsL ue&i-i. j-^^c , ^-iiic "fhp I 

came into popular use (Dyet, 1966), is based on. what he calls tne | 
p"il-l-ny model of a Lhool." Actually studet^t change per se is , 
^ ^ ^ . . approach,' instead, regression equations are 

'mean performance for a school. Tbese residuals 
ning "school effectiveness indices." 

AS initially concei^, the I^r approach'would distinguish four " 
maior Categories of variables cal^ input, surrounding conditions 
Scationa? process, and output. The input and -JP-^^^^^I^^/: . " 
variables refer to ^^^^^ characteristics measure^^ ^/^ria iS'w:':^ . 
a given period of schooling. While tnese gruut^^,, v ^^:^^1^TOpn^pd . 

of posttfest-scores. 

.:rur"nSi^?s-.$:"?r../s"^ifi-i. .a, i„aue„c. ....... 

achieveraenL. 

. -0 J. ^ > ^ 



n6ver assessed in Dyer's 
used to compute residual 
form the basis for obtai 



' . 4-25 



■ With the four categories of variables in hand tegresslon analyses 
involving the input, output and hard to change surrounding conditions' 
would be used to obtain "school effectiveness indices" for each output ' • 

measure. ' Specifically, using school means,- a given output or posttest 
-gt:ej:e would be regressed "on the input mea-sures and'hard to, change sur- 
roli^ing conditions would be used to obtain "school effectiveness indices 
for e^ch output measure. ^Specif ically , using school means, a given output 
or posttest, score would be regressed on" the input measures and hard to 
change surrounding condition variables. Schopls with observed mean scores 
on the. posttest that were above the value predicted for that school would 
receive relatively high school ef^^ectiveness indices. Scho9ls with , / 

posttest Tne,ans lower than predicted would receive relatively low indices":^ • , 

. i' ■ ■ ' - ' V » • . ^ 

Only after the school ef f ectivei^ess indices are obtained would the . i ^ 
easy to change' surrounding Gonditiops, and the school process variables ^ 
come into play. The focus would be on outliers, 1.6. , thosex-schools 
that have posttest means much better. (or worge) than predicted from the 
pretests and- hard to change surrounding conditions. The extreme out- 
liers, which in another context ^ould be called <"overachieverS and under- - 
achievers" (Thorndil^e , 1953), would then be compared in terms of the ' - - ^ 

ea^y to change surroimding condition variables and the educationdl pro- 
cess var^iables . ■ ' . 

Dyer was well aware that his proposed appro-ach give? no ■ guarantee 
of fiKding the character,istics_pf schools that produce the maximum 
achievement. Rather the approach was conceived of as a kind qf search - 
strategy for identifying v^ariab^es that might be ^ instrumental tx> better 
•student performande. The- actual efficacy of these variables could then 
be investigated in ^^xperimental studies.',., - • , - / 



ERIC 



' There are a number of questions tliat mighf be rdised concerning 
Dyer's approach. indicated in the first section of this pape? 

residuals. St ill may h^ questionable. -Dyer,, Linn and Patt^ . i ^ 
provided results th^- ar^ relev'aht to one. type of Reliability ot the 
' school residuals. School systems were subdivided into two randSm^ 

halves and residuals" computed for each half sample. The correlations of 
the half, sample resi'dual scores ranged from .73 t6 .88 for six different ^ 
^posttests. ^^^hil6 thfese results suggest reasonable stability , 'less ^ 
' encouraging results we,re "obtainfed by Forsyth (1973)^ when he investigated 
another Jtype of reliability. 

' • ■ . 

^'orsyth (1973) obtained school residuals according to the Dyer 
' . modal 'for two successive time intervals (posttests obtained in 1968 and 

.in 19;69)\ 'The correlations between residuals -obtained for schools at the 
V two different points in time ranged from- .11 to .50 for 10 posttests with 

a median -correlation of only .28. Thus, it would'appear that the resi- 
■ -duais may be relatively stable for.-of»e suBsample of stifdefits to another 
within a single year but relatively unstable frqm ^e year to the next. 
Thi's -instability, over time is seen -as a major limiatio^i on .the poten- 
tial usefulness of this ai^proach. ' ' , 

/ Recently, Marco (1974) cortp^ed four different methods of obtaining 
school effectiveness indices in fddition to the one originally suggested . 

"/ . . • ; • • 

/. 



9 



' 4-26 , . 

by D]>er. He, found that all-five methods yielded indices that were 
•higjily intercorrelated apd re^^atively stable f rom^ One half sample to 
anctther. His study does not.ad^dress the issue -^of' stability over time 
or "the practical^utility of the indices, however. ^ 

. " " CONCLUSION ' , 

' This paper* has ranged over a fairly broad spectrum of topics that 
sKar^ as a common thread concern about measuring change from pre to 
posttesting periods. _ Problems in measuring change abound and the vir- 
tues in doing so are iiard to find. Majox disadvantages in the \Jse'of 
change scores are that they tend to conceal conceptual difficulties^ and 
they give misleading results. The former tendency is apparent when change 
scores are used to compare preexisting groups which tends to conceal 
to the ar,bitraxiness of this particular form of adjustment. The latter 
tendency is apparent where various standardized test scales 5uch as 
"graoe equivalents or percentile ranks a're used to assess gains of dif- 
ferent groups of students. 

To conclude with Crohbach and Furby (1970) "...that investigators 
'who ask questions regarding gain scores would ordinarily be better 
advised to frame their (questions in other vays (p. 80)" may seem very 
discouraging. If so, however, it is probably because more is expected 
from gai?h scores than tliey can reasonably be expected to provide. 
They cannot, for instance, be expected to make up for the lack of random 
assignment, nor can other adjustment techniques. For most purposes, 
a pretest score is best treated on the sane footing^ as othei? measures 
that are obtaitted at the time of th^ pretest. \;here appropriate, 
regression, analy^s that treat the pretest no differently than other 
independent vari^les (or predictors) and the posttest as the dependent 
variable avoids many of the difficulties that are introduced by gain 
scpres. 



■ . r 4-27 

'■ . REFE^NCES \ /, • ' 

American Psychological Association, Standards for Educational and 
* Psychological Tes^s , V/a^shington, D» C: 1974* 

'Angoff*, w; H. Scal,es, norms and equivalent scores* In R. L. Thbrndike 
(Ed.) Educational Measurlgment , 2nd Edition , -Washington, C: 
'American 'Council on^Edu^tion, 1971. 

Beck, M. D.y Development ' of empirical "growth expectancies" for the 
Metropolitan Achievement Tests , Presented at the meeting of the 
. National CoufTciV on Measurement in Education^ Washington, D. C, 
. 1975. . ^ ^ ^ ^ ' 

EeVeiter, C. • Some persisting dilemmas in the measurement of -change. 
In C. W^ Harris (Ed.) »robleTAS in Measuring Change >/ Madison: 
' University • of Wisconsin Presfe, 1963, pp. 3-20. 

Coleman, *J. S. 'Karweit-, N. L. Measures of School Performance . Santa 
Monica, Calif^)rnia: Rand, R-:488-RC, July 1970. 

Cronbach, !. J.' & •Furb>^, L. How we should' measure "change" — ot/ should 
we? ^ Psychological Bftill^tin , "1970, 74, 68-Sb. 

CTB/McGraw~Hill<- ;Califomia Achievement Tests (1970 ed.). >Ionterey^, 
California: "TTB/McGraw^Hill, 1970. - 

DuBois, P. H. Multivariate Correlational Analysis . New Yofk^ Harper., 
1957. , - * % \ 

Dyer,* H. S. The Pennsylvania Plans Science Foundation , 1966 ^ 50, 
' ' '242-248. \ . , 

Dyer, H. S. Toward' pbjective criteria of- prof essional acco-unt ability ^ 
'in the schools of New York City. Phi Delta Kappan 1970, 

• 206-.jai. * . • -y, \ , . 

Dyer. H. S., Linn^ R, I. & Pktton,, M. J. A comparison of four methods 
of obtaining discrepancy measures based on observed and predicted 
school system means on achievement tests. . American Educational 
Research Journal, 1969, 6^, 591-605. \ _ 



Edu 



cational -.Testing Serv?.ce. School and College Ability Test . Princet 
New Jersey, Educational Testing Service., 1957; " . ^ . 



on. 



Elashoff, J. D. 'Analysis of covariance: a ddlicate ihstruinent. American 
. Educational Rdserarch Journal > 1969 , 6_, 383^*402 . 



Feldt.'L. S. A comparison of the precision of three experimental designs ^ 
employJ|ng a concoi^ltant .'variable. PsysTiometrika , 1958, 23, JJ5-J:>J. 

Forsyth R. 'A. Some etiipirical results -relalied to the "stability of I 

-performaoce indicators in f)yer'k student cjlange model'of an edu- ) ' 
cational system. Journal of Educational Measurement , 1973, 10, 

■ V ■ ■ ... . ' J 



ERIC : 8 i 



4-28 



Glass, G. V. The many faces of "educacionSl accouncabilicy" . Phi 
Delta Kappan ,' 19 72, 53, 636-639. 

Goulet L. R. Longitudinal and tine-lag designs in educational research: 

an alternate sanpling model. P^viet; of i:ducati6nal Research, m press. 

Harcourc Brace Wano^vich , >retropolita n Achiev.enent Tests (1970 ed.), 

• New York: llarcourt Brace Jovanovich, «19 70. 

Harcourt Brace Jovanovich, Stanfprd Ac hievement Tests (1973 ed.), ^ew 
Yofk: Harcourt Brace Jovanovich, 1973. 

Hilton- T. L. & Beaton, A. 'e. ' Stability and insUbility in academic 
growth ~ a comoilation of lofigitudinal data. Final Report, 
'August, 1971, Educational Testing Service, Grant No. OEG-2- 
7000013(509), U. S. Office, of Education. 

Hilton T.' L. & Patrick, C. Gross-sectionai vef-sus" longitudinal 'data: , 
aA 'empirical comparison of nean differences in_academic growth. 
Journal of Educational Measurenent , 1970, ]_, 15-24. 

. ' . V - 

Lindquist, E.F. & Hieronynus, A. N. Manual for administrators, 

• ' supe rvisors and , counselors Iowa Tes ts of Basic Skills, -oston, 
' .Massachusetts: Houghton Mifflin Company , 1964. 

Linn \ L. Th- use of standardized test scaled to measure growth." Con- 
' ference'on PolicV Research: Methods and Implications . University 
of Wis^consin, 'Madison, l^isconsi^n. May 1974. 

Linn, B. L.''& Werts, C.' E. Errors of inference due errors of raeasure- 
ment. Educational and Psycholo gical M.easurement , 1973, 33, !)J1-:>aj. 

Lord, F. M, The measurement of growth. Educational and Psychological 
■ Measurement , 1956, 16, 421-437. See also Errata, ibid., 19d7, 
. 17' -^52. 

■ Lord 'f. H. Further problems in the measurement of change. _Educational 
' . ' and Psychological Measurement , 1958, 18, 437-454. ' 

'^Lord F. M.' Large sample covariahce analysis when the , control variable 
\'ls fallible. Journal of the American St atistical Association, I960, 
' '55,^ 309.-321. , ■ •• , 

'LoVd-F M. Elementary model^. for measuri^ng change. In C : V7. Harris . 
"-"''''v'. \ P?nM.... in MeasatlnR Change. Madison: . University of Wisconsin 
Press, 1963, 21-38. 

Lord F' -M. A paradox in 'the interpretation of group^ comparisons , Ps^;^- 
' logi cal Bulletin , 1967-, 68, 304-305. . ^ ■ , _ , 

~f~ : ' 1- ' .r . 

Lord F. M.' Sbatisticdl adjustments when Comparing pre-existing groups, 

^Pgvctolo.^cal Bul|etin, 1969, 72, 336^337. _ 'I • 



4-29 - 



Mahning, W. H. DuBois , P^H. Correlational ^^^^ods iA-f^earch on 
hu^an subjects. Perce^ptual Motor Skills , 1962, 15, 28/-321. 

Marco. G. L. A conio arisen of selected' school effectiveness measures 
- based on longitudinal data. Journal of Educational Measurement, 
'- 1974, 11, 225-234. 

Marks, E. £, Martin, C G. Further coinments relating J^^^e measurement of 
change. American Educational Re search Journal, 19 73, lU, 1/-^- 
191 . 

McNemar, Q. On growth moasurenent. Educatiotial and Psychological 
Measurenent , p58, 18, 47-55. 

Meehl, P. F.. Nuisance variables and the ex post facto design. In 
M Radner & S. Winokur, (Eds.) Minnesota Stu dies in Pnilosophy 
of Science Volume 4 , Minneapol^Ls : University of Minnesota Press, 
1970. . ] * , .1 . ' ' 

O'Connor, E. F., Jr. Extending classical test theory ^othe measure- 
ment of change. Review of Ed ucational Research, 4£, /J V/. 

Porter A. C. The "effects of using fallible variables in the analysis of 
c^variance. doctoral dissertation, ""^f ^^^^^ °f ""^"f/^^/ 
Ann Arbor, Michigan: University Microriiins , 1967,\No. >67-lZ, 

147. 

Prescott, G. A. Manual for Interpreting: Metro p olit an Ac hievement Test , 
New'vork: Harcourt Brace Jovanovich, Inc., 1973. 

Rosenshine, B. & McGaw, B. Issues in assessing tfeacher accountability in 
public education. Phi Delta Kappan , 1972, 53, 640-o43. 

Rubin, D. Estimtlns^au^L^ ^ Treatments in ^^^f ^^"^^^/i f ^^^.^ 

6bservI^i^Kirrtudies . Princeton, N. J.: educational Testing Service , 

Research Bulletin, 72-39, 1972 .' 

c;rhn1«stic Testing Service." Educational Development Series Te chnigaq 
R^nort EleLntarv Level - Sprin, 1971 . Bensenville, IllinoiiT 
Scholastic Testing Service, 19 71. 

Stake R. E. Testing hazards in performance contracting. Phi DelC^- 
■- Kappan . 1971, 52, 583-589. ; ' ^ > ^' ^ 

■ • Tatsuoka K K. Vpr tor-Geometric and Hj Ibfert- Space Reformulatipns of 

■ "^'^ g^;.-;al TelTThi^^T:^^^^ university at Illinois, 

■ 1975. _ . . ' ! , 
j-i tj T T>^<= ronceots of over- an d undeirachievement . New, York: 

^°''t^:J.: o;ivSlifT^ hers ColUee, Bureau ot ..ubii. ations. 

1963. ■ I " ^ ■ 

T^orridike, R. r. ' Intellectual status ^and intellectual growth'. Wnal 

o'f E ducational P syghoj^, 1966, ^ 121-127. 



ERIC 



4-30 



Wargo- II, J. et al. E gEA Title I; A reanalysis and syfithes^ of eval- 
uation dtFTTroin fiscal yg^r 'iQ&S through 1970 . Palo Alto. 
Calif omia^ Anerican Institutes for Research, 1972. 

Werts C E Jbreskog, K. G. & Linn, R. L. A multitrait-multime thod 

^odel for studying growth. Educational and Psychological Measurement, 
1972, 32, 655-6^ 

Werts, C. E. &-Linn, R. L. A general linear model for studying growth. 
Psychological Bulletin , 1970, 21, 17-22. 

Werts, C. E. & Linn,!?.'. L. Analyzing school /effects : ANCOVA with a 
fallible' covariate, Educational and Psycholog ical Measurement, 
1971, 31^ 95-104. 

Wrightstone, J. V|. , Hogan, T. P., & Abbott, M M., Accountability in 

education Jd_3^fl5a^ measurement problens. Test Servi<te Note- 
ba(^k33^ev~Yor\i: Hatcourt Brace JovanoVich-, Inc., (undated;. 

— , ^ \. 



er|c 



A 



8 1 



. 'CHAPTER 5 y . 

VERTICALLY EQUATED TEST FORMS 

In large scale testing programs it'- is frequently necessary and 
desirable to have several forms of a test. Multiple Jorms are essential 
for admissions tests . such as the' College Board^s Scholastic Aptitude 
Test or the American College Testing Pi^ogramV^T^sts!' The^^rposeVf 
the equating is to convert the raw scores obtained from tw^ forms of 
the test "...so that 'scores derived from uie two forms after conversion 
will be directly equivalent (Angoff , 1971, p^. 562)" In the case of 
admissions tests, equating is essential because comparisons are made 
between persons who take different forms of the test and without the 
equating persons who happened to take one form of, the test that Ws 
inadvertently more difficult than another form would be' at a dis- 
advantage relative to their peers who happened to take the easi 




Equating test forms that are designed to measure the same thing 
for thg^ame population is sometimes referred to as horizontal equating 
(see, for example, Educational Testing Service, 1957, pp. 7-9). Vertical 
equating, on the other hand refers to the process of converting scores of 
forms of a test designed for populations at different educational levels 
to a single scale. In horizontal equating, different forms oi the test 
would normally be designed to have comparable item content and similar 
distributions of item statis tics . The equating adjusts for unintended 
differences in difficulty of the tests or differences in distributions of 
the examinees. In contrast, iotmi to be vertically .equated differ 
intentiona2,ly in the difficulty of the items for a single population of 
examinees and in their content specifications as we^ll. For example, an_. 
appropriate arithmetic item might be 4 + 3 = ? at grade 1, 155 - 62 = ? 
at grade 3, x 4 = ? at..grade 5, and 5.45 -f .25 = ? at grade 7. To 
be sure,^ such items are all in the general domain of 'arithmetic but - 
they are not necessarily indicators of a singly common trait. In 
other achievement areas even greater divers^ity of item type, ditjiculty, 
and content' frequently can be found as changes in the level of a test 
occur while a common name and supposedly camraon scale is maintained. 
It is no surprise that the problem of ve^rtical eq^uating is substantially' 
more difficult than that of horizontal equating. 

this section, the two most commonly used equating procedures will 
be briefly reviewed. The Adequacy of these riiethods for the vertical 
equating problem will then be considered. Firfally consideration will be 
given tp alternative equating methods with special emphasis on the use 
of the^ Rasch model, 

LINpAR AJID EQUlPERCLjNTILE METHODS! 

two scorjes, one» on form X and the other on form Y .(where X and 
same function with the same degree of reliability), 
ed equivalent iS their corresponding percentile ranks in 
o^^^ are eqiikl (Angc|ff, l97l|, p ; 563)/' This cc^mmonly accepted 
definition suggests immediately the equipercentile mefchod c})f equating. 
All tQat is required fdr the equipercentile method of equating is the 



Y measure th^ 
may be co'nside 
any gj[l^en gro 



5-2 



cumulative frequency distribution for each test., The k— score level 
on^orm X, X^, is converted to the same scaled scoraeas the 1 score 



level of test Y, Y^, if the percentiles^ 



an4 Y^ are the same. In 



practice; smoothed frequency distributions are ey^icaUy used and raw 
scores on the tests corresponding to some predetermined set of .per- 
centile ranks are found by interpolation. Also, there are a variety 
of different study designs that night be used for the equating. For 
example,- both tests may be administered to a single group,, the test^ ■ • 
may be administered to a different rand6m~ sample from the same population, 
or the tests along with a common anchor test m^y be administered to a sam- 
ple from different populations. For a detailed description of these . 
and otTer^poiilble designs see Angoff, (1971). I^oring these ^.pijecedu- . 
ral details', however^.^ equipercentile method is quite straight forward. 

'Linear equating woulB^^sign the same scaled score to scores and 
Y if they correspond to the same standard score, that is if 




TF^^ — 



respectively. n 



are thfe means and standard deviations of X and Y 

by Angof'f Xvm) , the equipercentile and linear 

equating methods coinci-d£> the two marginal1:3:&.tributions differ 
_„lX in their first and^cond mcments. More gene*ally, -the two methods 
will yi^-^itnilar results when the raw score frequency digCributions 
are similar, \ 

For purposes of"vWeical equating there are two important aspects of 
the above paragraphs t|at need to be corisidered. (1) Linear equating 
might be expected to be. less adequate than equipercentile equating for * 
the vertical situatWbecause. there is less reason to expect ~X and Y 
to have distributioL/of about the same shape. (2) A key aspect in the 
definition of equivalent scores given above is the requirement that the 
percentile ranks be-^uaL." .. .iit any given, group...'.' If ^t:his -quiretnent 
'is not met then W conversion will not bV^inique. More will be said 
about ^his second point below but first a few comments are offered , 
regarding the Uke?y utility of th^ linear method in vertical equating. 

THE' A.NCHOR TEST, STUDY ^ ' i' , 

!■ equating study fever conducted was the j, 
Ancnorxesc .cuuy ........i and Loret, 1974). (For a more complete 

TeTiel of he.Anch^r.TesfsUy see Appendix A.) This study, and its 
supplement equated eight widely used standardized reading tests at 



^^.foubtedly the larges 
Anchor' Test Study (Bianchin 



8.) 



5-3 



grades 4, 5, and^6, ^Although the equating was done saparat-ely within 
^-each 'grade, and thus the equating might naturally l^e viewed as hori- 
^■^ontal, 'the* results are in fact quite relevant to the prpblem of • 
vertical equating. The tests being equated differed substantially , 
in * difficulty level as well as in content specifications. Furthermore', 
there were a variety of patterns of common versus different forms 
used at grades 4, 5, and 6 which make it possible <o coftipare , equated 
scores at ^ne' grade level with those at another. 

^ The various pairs of tests involved in the anchor test study 
wete equated, by botlii the equipercentile, and linear methods. These 
metfiods were ^compared in terras of thp estimated errors of equating 
which wBre obtained by the use of McCarthy's balanced half-sample 
replication method (1966). The equaling design consisted of a 'set of 
'eight balanced, *half-s ampins. Thes^ half-sample replications were used 
to compute the root-mean squared deviation of equivalent^ scores on , 
the anchor, test foi:^each half-sample replication about the ^anchor 
test equivalent scores for the full sample. Based on^ thie estimated 
errors the equipercentiie method was judged to be clearly superior . 
to the linear method. Furthermore, the^ degree of superiority was 
greatest for*those tests which differed .most from the anchor test in 
their level oX difficulty". Based on these results and logical con- 
siderations -^bout the likelihood that distributipns of forms to be 
vertically equated wil'l differ in moments higher than the- second, the 
equipercentile method seems preferable to 'the linear method in the 
verticals situation. * * * * 

The lAnchor. Test Study- also provides * another.^f€rtfm of evidei^ce that 
is relevat^t fof tiie/^roblem of vert^^^a^rt. equating. Two tests involved 
in the study changed levels b^tvee/i grade*s 4 and 5, three tests changed 

-levels between grades 3 and 6, .twoVests involved a sjLngle le^^el over 
all three grades and- one\ test -chang^ levels at each grade. These 
different patterns of levelsp^^make possible a variety of comparisons 
of the equatings of two levels of one test to ^ single level oL another 
test. For example, the* same level of California Achievement Tests, 
CA^r, (CTbJ McGraw-Hill,, 1970) was used^a^t grades 4 and 5 but different 
levels of the ^Hetropolitaja Xchievemerit Tests, JIAT, (Harcourt, Brace,, 
Jovanovich, 1970) were used at those grades, ^^^^ing the CAT equi- 
valencies of the MAT, it is possible to* conver^he MAI Elementary Level 
Reading scores to equivalent Intermediate Level Reading scores. For 
purposes of illustration,, a few scopes of the CAT at grade. 4 were 

•selept>^ and' the equivalent Elementary Level MAT scores Were noted. 
The same CAT scores were then used at grade 5 to find the equivale^ft- 
Intermediate Level llAt raw scares; These scores are shown in Table 5- 
1. The publisher's norms were used to convert the equated MAT Ele- 
mentary and^ntermediiite raw* scores td-<grade equivalent scores. The^ 
resulting grade equiv|ij.ent scores are alfso reported in Table 5-1. Fi- 
nally, the grade equivalent score afe grade 4 was subtracted from the 
corresponding score. at grade 5 and the difference w,as recorded in the 
last column of Table 5^1, • - » 



JO 



... . 5-4 

- . ■ - , - •'•> 

if the two columns of grade equivalent scores in Table 5-1 ^jf4 com- 
pared, some non-trivial differences in the grade equivalents can4)e ob- 
served. The largest of the differences in corresponding grade equivalents 
shovm in Table 5-1 occurs for MAT raw scores that are equivalent to a 
CAT raw score of 60. At this level, the grade equivalent scores are 6.6 
at grade 4 ^nd 7.4 at grade 5 for a difference of 0.8 grade equivalent 
units which -w)uld presumably be interpreted as almost a "year's gain." 
Except at the extremely high end of the distribution, ' the grade equi- 
^)'al^nts' tend to be large?: :at grade 5 than at grade 4. 

\« 

A number of other test comblinati ons could be used to produce tables 
such as Table 5-1. For example, the grade 4 and grade 5 MAT scores * could ^ 
be equated through their links to the Comprehensive! Tests of B^asic 
CTBS, (CIB, McGraw-Hill, 1968) rather than through the CATV This was 
donft and the results are reported in Table 5--2. As can be seen in Table 
5-2, the grade equivalents at grade 5 again tend to be higher than, the 
-corresponding gr^de equivalents at grade 4, 

^ - * ^ / ' . • ' 

^ The results in Tables 5-1 and 5—2 suggest that changes in grade 

equivalent units might .differ substantially depending on whether a 

single level of a test or two vertically equated levels of a test are 

being used in, say, a longitudinal research study. In particular, 

larger gains would be expected , using th^ Elementary level of the MAT 

' at grade 4 and the Intermediate level of the MAT at grade 5 than would 

be expected if either level 2 of the CTBS or level 3 of the CAT were 

used at the two grades. 

In addition to the grade equivaa^ent scores, vertically equate'd 
"standard scores*' were also compare^ The standard scores reported 
by. the test publisher of the MAT test are scaled to nrange from grade 1 
to grade 9. At grade 4, the mean scaled score is about 66. and the 
associated standard deviation is about 14. By grade 9, the mean and 
standard deviation are ^approximately 96 and 17 respectively. 

The grade 4 and grade 5, standard scores of the MAT were compared 
by converting equivalent raw scores on the Elementary and Intermediate 
Levels of the MAT to standard scores. When the CAT was used to define 
equivalent raw scores on the flAT, the results in Table 5-3 were obtained. 
The results in Table 5-4 were obtained by using the CTBS to^ define equi- . 
valent MAT raw scores for the two levels of the MAT. ' For all but rela- 
' tively high scores, 'the Intermediate Level MAT standard scores are some-* 
what higher than the "equated'* Elementary Level standard scores. This 
is true whether the equating is accomplished via the CAT (Table 5-3) 
or vi-a the CTBS (Table 5-4). Furthermore, the magnitude of the difference 
in standard scores is relatively large^in some parts of the score dis- 
tribution. * * 

it miiht be noted that* the largest differences in standard scores 
reportied in Tables 5-3 and 5-4 occur at the extremes where relatively 
few otiseirvatioris are expected. Even in- the central part of the sco^e'' 
range,} however, the differences are as large a^ a third of a within 
grad^' Standard deviation. A difference as- hig is a'third'of a standard 
• deviation is apt to- loom large relative to the aagnitudel^f "effects^' 
that are being evaluated. Thus, whether grade 'Equivalent scor^es or other 



ERIC . ^ . ' 'Oi 

f 



5-5' 



' . TABLE 5-1 

T6'tal Reading Equivalent Scores on^ the MAT Elementary 
and Intermediate Levels (Grade Equivalents via^CAT) 



Equivalent ^!AT Raw Scores and 
Corresponding Grade Equivalents 



r 



Level 3 
CAT Raw 
Scores 

(Grades & 5) 



Elementary Level 
(Grade 4) 



Raw 



GE 



Intentiediate Level 
(Grade 5) 



Raw. 



GE 



Difference in 
GE- Scores 
(Grade 5 minus! 
Grade 4) 



80 
70 
60 
50 
40 
30 
20 
10 



94 
89 
84 
76 
63 
45 
26 
12 



9.9 

8.4 

6.6 - 

5.2 

3.7 

3.2 

2.3 

1.3 



91 
76 
63 
51 
39 
29 
20 
8 



9.8 
8.4 
^4 
5.5 
4.4 
3.5 
2.6 
1.4 



-O.I- 
O.O 

' 0.8 
0.3 
0.7 
0.3 
0.-3 
0.1 




•; : ..V. 



ERIC 



9 2 



TABLE 5-2 

Total Reading Equivalent Scores on the >LAT" Elementary 
and Intermediate Levels (Grade Equivalents via CTBS) 



Equiv^ent.JlAT Raw Scpres an4_ 



Level 2 

(}:TBS Raw 
Scpres 


Elementary LeVel 
(Grade 4) 




. Intermediate Level 
(Grade 5)' - 














(Grades 4 & 5) 


Raw 


GE 




*' Raw 




GE 


?80 


9.3 , 


. i 

9. 


8. 


87 




9.8 


' 70 


86 


7. 


3 • 


69 




6.9 


60 


•. 78. . 


5. 


A 


^ 55 




5.7 


50 


68 


4. 


3 


44 




4.9 


40 


56 


3. 


5 


35 




.4.2 


30 




• 2. 


9 


28 




3.5 


20 


. • 24 • 


2. 


0 


2a- 




2.6 


10 « 


12 


* i*3 


10- 




1.6 

















Difference in 
GE Scores 
(Grade 5 minus 
Grade 4) 



0.0 
-0.4 
0.3 
0.6 
0.7 
0.6 
.0»6' 
0.3 




5-7 



TABLE 5-^3 



Total Reading Equivalent Scbresjon the^MAT Elementary 
and' Intermediate Levels (Scaled Scores* via CAT) 



Equivalent 'MAT Raw ScorojLi 

Corresponding Scaledr Sc ofi 



and 

es 



A 



Level 3 
,CAT Raw 
Scores 

(Grades 4 & 5) 



Elementary LeVel 
(Grade 4) 



R'aw 



.Scaled 



Intermediate Level 
\t;rade '5) . 



Raw 



•Scaled 



Difference in 
Scaled Scores 
(Grade 5 1 minus 
Grade 4) 



80 

70 

60 

50 

40 

30. 

20 

10 



94 

89 
84 
76 
63 
45 
26 
12 



119 

^ 94 

, 84 
75 
66 

- 58 
47 
26 



91 

7-6 
63 
51 
139 
\29 
20 
8 



117 

• 91 
83 

■ 77 
.70 
62 
52 
29. 



-2 
-3 . 
-1 . 

2. 
4 
4 

5 . 
3 ■ 



*MAT Standard 'Scores 



ERIC 



9 



4 • 



•5-8 



TABLE 5-4 ' . ' 

Total Reading Equivalent Scores on the MAT Elementary 
and intermediate Levels (Scaled Scores*. via C-TBS) ^ 



Level 2 . 
CTBS Raw 



Equlvaldnt^MAT Raw Scares and 
Correspondj^n^Scal^d^ Scores 



Elementary Ley^l 
(Grade,,4) * ^ 



Intermedi'ate Level 



Differerfce in 



Scores \' 










(Grade' 5 minus 


(Grades 4 & 5) 


. Raw 


Scaled . 


Raw 


Scaled \ 


.^Grade 4) - 


80 


93 


112 


97 


105 


-2 


7Q 


86 


88 


1 69 


86 


. -2 


60 


78 ■ : 


77 


55 


' "79 


2 


50 


68 


69 


" 44 


73 . 


.4' 




56 


62 


3"5 


67 


5 


30 


. 41 


56 


28 


. ■ iQ' '61 


5 


20 


24 


^45 


20 ■ 


' .52 


• 

7 ' 


10- ■ * . 


12 


• ■26 ■ 


10 


. ■ 34 


8 - 















*}SKt Standard Scores 



V 



ERIC 



9.: 



5-9 

, . , < I 

scaled scores are used, change observed on the same .level of a tfest is 
apt to yiel'd different results than change observed over two vertioaily 
elated levels of attest.. - ^ ' 

>^ In. tables 5^1 through 5-4, except 'for very high scores, there is a 
consistent tendency for the "^jf^ed*' scores based on the higher' levei ' 
fomr'to^be larger' thc7n their counterparts based on the lower level 
'form. If this was a^ general trehd^ then it might be possible 'to aompen- 
sate fot the tendency. Unfortunately, this trend does not hold, for all 
test combinations • . ' ^ 

^ ^ Additional compac4sons of verticaldy equated scores ba'sed on the 
results^ of the Anchor Te^t StuUy are reported in Tables 5-5 through • 
-5-8. The, results in Tables 5-5 through 5-8 provide comparisons of 
results for grades 5 and 6. At those grades', the same level (Inter- 
mediate IJ) of^the Stanford Achievement Tests , 'SAT, (Harcourti Brace, 
Jovanovich, 1973) was 'used while different levels of the CAT (Levels 
3 and 4) /and of the CTBS (Levels 2 and 3) were used. ,lh Table 5r5, 
selected raw scores on the S/iT are repQrted along with equivalent' CAT 
Level, 3^and CAT Level 4 ray scores and associated grade equivalent 
scores'/* The differences in "equated*^ grade equivalent scores are also s\ 
reported in Table 5-5., A sirailsir set of results for CrBS Level .2 'and 
Level 3* grade equiv^tpnt scores are reported in Table, 5-7. The results 
in Tables 5-6 'and 5-8 w^re 'obtaiaed in parallel fashion except- that 
other vertically equated scaled scores that are, reported by th^publisher 
are us4d. * * . 

' The results for the<CAT grade equivalent scores (Table 5-5) havd 
a pattern just the opposite of the one previously: encountered for the 
MAT. That Is, except for the highest scores, the higher level form 
tend^ to yield lower graxie equival«r^t scores than the ''equated** score 
'of the lower level form. It should also be noted that the magnitude of 
the grade equivalent score^ differences in Table 5-5 tend to be smaller - - 
for scores in the mi^ddle oj the range dhan were the differences in 
Tablet 5-1 or 5-2'; . ' ^ > ^ 

The results in Talkie 5-6 are based on Che CAT Achievement Develop- 
ment Scale Scored':- . These ^cores are scal&d to span grades 1 to 12 with 
a range of scor,es, from 100 to 900. The mean at graUe 10 is set at"^ 60O 
and the standard ' deviation at ipO. *At grade 4, th^ mean is about 4O0 
and the standard deviation- about &5. ;'Tne results in Table 5-6 are 
similar tb those ia Table ,5-5.' The 'Athievement Development Scale Scores 
are low^r for Level 4 than for Level 3 except at t*he very high end of 
the score distribution. The magnitude of th6 difference for *the middle 
range of seores ,is only about an eighth of a within' grade' standard 
ileviation or less. . - 



' , In Table 5-7, the CTB^ Level 2' and Level 3 grad^ (Equivalent scored 
that correspond to common SAT scores are reported. In the middle part* 
cif ttHe scor^ range, the Le4el 2 gi^ade equivalents are'Kigher, than their 
Level 3 counrerparts and tH,c opposite is true at both^extreraes of 'the 
score distribut^n. The md)gnit*ude*^^of th(? difference in thel middle^ part 
of the scpre distribution is 0,3 or 0.4 grade equivalent unjits. ^Similar 




TABLi: 5-5 

Total Reading Equivalent. Scores on. the CAT 
Level 3 and Level 4 (Gfade Equivalents via SAT) 



Equivalent CAT Raw Scores and 



« 

Intermediate 
II SAT^ Raw . 


LeVel 3 
(Grade 5) 






Level 4 
(Grade 6) 




Scores ' • 












(Grades 5 & 6) 


, Raw 


GE 




Raw 


GE 


110 


82 


12 


.4 


■ 82 , 


13.6 


lOO 


80 


11 


.4 


71 


11.5 


':- .90 


« 

77 


10 


a 


- 62 


9.8 


80 


72 


8 


.5 


55 


8.5 


• ' 70 


68 


7 


.7 


A8 ' 


7.5 


60 ' 


63 


7 


.0 


42 


6.8 


' • . 50 


56 


6 


,1 


36 


5.9 


^ ^ ' 40 ' ^ 


47 ■ ^ 


5 


• i 


29 


4.9 


• 30 


35 


,3 


.9 


22 


3.5 


"20 


22' 


2 


.4 


• ,16 • 


2.2 


10. 


13 


1 


.1 


9 


0.6 



iTifference in 
GE Scores 
(Grade' 6 - minus 
Grade 5) 



0.7 
0.1 

-0.3 
O.O 

-0.2. 



-0.2 
-D.2 
-0.2 
-0.4 
' -0 .2 
-0.5 




0 



. TABLE 5-6 ' . 

Total Readiftg Equivalent Scores on the 'CAT 
Level 3 and Level 4 (Scaled Scof'es* via SAt) 



Equivalent CAT Raw Scores and 
'CoTr'esponding Scaled Scores 



/Intermediate^ 
II SAT Raw . 



Level 3 
(Grade 5) 



Level 4 
(Grade 6) 



Difference in 
Scaled* Scores 



Scores 










■; \^rade 
Grade 


6 minus 


(Grades 5 & 6) 


Raw 


Scaled 


Raw 


Scaled 


5) 


110 


82 


665 


82 


757 




92 


100 


80 


625' , 


71 


; 626 




1 


90 


77 


580 > 


62 


- 566 




-14 


80 


72 


530 


55 


• 528^ 




- 2 


70.' 


68 


503 


48^ - 


49-7"' 


« 


- 6 


• 60 * 


. 63 . 


480. 


42 


474 




- 6 


, 50 


$6 


454 


36 


• , 450 




- 4 


40 


47 


424 


29 


• 415 




- 9 


30 


35 


380 


22 


364 




-16 


20 


22 


318 


16 


306 




-12 


10 v> 


13 


259 


9 


• 232 




-27 



*CAT Achievement Development Scale Scores 




TABLE 5-7 



Total Reading Equivalent Scores on the CTBS , 
Level 2 and Level 3 (Grad^ Equivalent via SAT) 



Intermediate, 
li SAT Raw 
Scores 

(Grades 3 & 6), 



Equivalent CTBS Raw Scores and 
Corresponding Gra^e Equivalehts* 



LeveL 2 
(Graders) 



.Raw 



. GE 



4* 



Level 3 
(Grade 6) 



Raw 



IT 



GE 



bifference in 
GE Scores 
(Grade 6 minus 
Grade 5) 



UO 
100 
90 
80 
,70 
60 
50 
40 
30 
20 
10 



85 


,11.9 - 


•84 


12.9 


1.0 


82 


'll.5 


75 


11.5 


0.6 


79 • 


5.7 


,• 66 


9.4 


-0.3 


...76 


8.7 • 


■ 59, 


8.3 


-0.4 


72 


7.6 


52 ' . 


7.3 " 


-0.3 


68 


6.9 


45 


6.5 


-0.4 


_ 6 2- . 


6.0" ■ 


37 


5.6 


-0.4 


- 55 


^ 5.1 


• 30 


4.7. . 


-0.4 


38 


"^9 ■ 


22 


'3.6 


' -0.3 


23 • 




16 


' • 2.5 


-0.2 


12 


1.2 




^2.0 


0.8 



erJc 



■9i) 



^ ^ TABLE 5-S . . _ 

Total Reading Equivalent Scores on the* CTBS 
Level 2 and Level 3 '(Scaled • Scores* -via' ^AT) . 



Iptermediate 
II >SAT Raw 
Scopes 

(Grades 5 & 6). 



Equivalent CTBS Raw Scores and 
, Corresponding Scaled Scores" 



Level 2 
(Grade 5) 



Raw 



Scaled 



: L^vel 3 
(Grade 6) 



Raw' 



Scaled 



Difference in 
Scaled Scores 
(Grad'e 6 minus 
^> Grade 5) ' 



110 
100 
90 
80 
..70 
60 
'50 
40 
30* 
20 



*CTBS Expanded Standard 'Scores 



85 


7A4 


8A 


. ' 786 




A2 


82 


660. 


75 ■ 


6A1 


I 


-19 


;79 


612 


' 66 


579 • 




-33 


16. 


. N55A 


5ft.. 


• 5A3 




' -^11 


72 


523 - 


52 


' 513 




-10 


68 


A97 


A5 


- A83 




,-lA 


62. ' 


'A65 


, 37 ' 


, • A51 




-iA- 


■ 53 


A 33 


^ 30 
22 


' ,421 




• -12 


38 • 


386 


. 370 




-16 


23 ■ 


325 


16 


31A 




-11 


'12 


236 


9 


2A7 


t 


-11 



ERJC . 



'■^'v • < ' . ■ - • , "X'-' ' 

results' are reported in Table 5-8 using the CTBS Expanded Standard - v * 

Scores which range from 100 to 900''with a raean and standard deviation^ ^ ^^^2^^" 
at grade 10 of 600 and 100 respectively. The magnitude of the differ enc^^^^i;^^ 
in Table 5-8 tends to be about one fifth of the standard deviation * '^'^ 
observed, at grade 5 (which is about 72) • * - ^, 

In summary, the results in Tables 5-1 through S-S^-raise dd^ts 
about the adequacy o^f the vertical equating. Change observed on a ' 
S*ingl6 level of .a tefst is apt /to have a different meaning than the same 
• change, observed on vertically equated levels of-Jtlie same test.' Un- 
fortunately, the direction of the difference is apparently not consistent. 



THE RASCH MODEL 

An important aspect of the definition of equivalent scores that was ^ 
mentioned above is that the cortesponding percentile rank^ b^e equal . ^< 
for "any given group.'* V/ith presently used^methods of equating, this '\ 
ideal is only roughjy approximated for vertically equated test forms. 
This may simply be a reflection of the difficulty of the task rather . 
than a fault 0-f the methods.- It is possible, however, that a rather,* 
different approach to the problem would yield better results. If^ 
so, that would be a valuable contribution to longitudinal research 
studies. An approach that appears particularly promising for the problem^ 
^of. vertical equating is. one based on the Rasch (I960, 1966a, 1966b)' 
model. • . , . • ■ 

The- appeal of the Rasch modelis apparent in Wright's (1968) 
des*cription of the model- as providing "person-free test calibration" 
and Viten-free person measurement." What is meant by person-free test 
calibration is that the item parameters that' are estimated are invariant 
for all groups of persons. Item-'-free persofi measurement, on the other 
hand, means that once items have been calibrated that except, for errors 
of measurement, the same , score would be obtained for an individual 
regardless of which sjabsejt of items is used* for the measurement. 
These properties are pre"Cri«^^:y what is needed for the vettical equating 
problem. j » - ^ 

Rasch 's model is a particular instance of a latent tra'it model and 
presumably^ the comments about the potential use of the model in achieving 
invariant item parameter and person scores cooild apply to other latent 
*'trait models. The primary "potential advantage of the Rasch model is 
its relative simplicity in that It-ems are characterized by a^ single 
parameter. This characteristic may at the same time be the primary 
'potential disadvantage of the model, however,, if it proves inadequate 
for characterizing item response data. * * 

The Rasch model is a special caSe of Birnbaum's (^96§) logistic 
model. Three types of logistic models might be distinguished according 
to the number of parameters. Birnbaum's three-^parameter model assumes 
that the item characteristic curve can be specified terms of a 
^location i^arameter, an item discrimination parameter, and a "parameter 
allowing f or ,a non-zero lower asymptote. In the two paraneter^ moclel, 
it is assumed that only the location and discrimination parameters are ' 
required, and in *the Rasch model, it is assuraed that only the location para- 
is required. Thus ^a natural question that needs to be addressed if the 
Rasch 'model were to be used for the problen of vertical equating is whether one 



ERIC . lly! 



5-15 



or both of the other parameters are necessary, » Regardless of the number 
of parameters, all three logistic models assume that a unidimensional 
trait underlies the items • 

Ignoring estimation problems, the three parameter logistic modet^i^ 
undoubtedly Tpore adequate than the two parameter model or the Rasch model 
with only one parametex^'per ite^. .-S^ent work by Lord (1975) suggests that 
in the long run the three-pararaater lo^stic model may prove to provide 
a much improved means of v_ert'icaj„equating. The main disadvantages of 
the approach. are ,the' demands' for very large sample sizes to achieve 
stable estimates and t:he considerable computing costs. The Rasch model 
is much simpler computationally than th^ three-parameter logistic model 
whp.ch would be a substantial aiivantage if the model provides an adequate 
, approximation to real sets of data. 

Followingthe -notation of Wright and Panchapakesan (1969)^ the 
Hascji model speclE^fces that the probability of a correct response to the , 
i— item by the n— indivixiual is 




where a . is the item score which takes a value of 1 if the response is 
ni ' 

correct and zero otherwise, Z is the ability score^for the n — person, 

* n ^ 

and is the item easiness. For most purposes, it is more co.nvenient 



to 



deal with log ability (b ^ log Z ) and log easiness (d. * log E.) ^. 

^ . n n 1 1 . 

which make it possible to ekpress the log odds, L^^, in the simple form 



ni 



1 - 



ni 



= b + d, . 
n i 



As previously indicated, there are three assumptions of the Rasch 
model that may have questionable validity for typical multiple choice 
test items. That is, (1) the test ir^y be multidimensional, (2) the 
it^ms may vary in discriminating powl^ and (3) there may be a non-zero 
probability due to guessing of getting ^an* item right regardless of the 
ability of the examinee. VJright (1968) acknowledged these three problems 
but argues that test construction should purposefully try to minimize them, 



5-16 




Some investigations of the robustness of the Rasch model under 
violations of the, assumptions of equal discriminating power and lower 
asymptotes of zero have been conducted, liambleton ^nd Traub (1971) 
generated item response data based on the Birnbaum three-parameter 
logistic model. XheyvAien compared the results based on an assumed 
Rasch model and an assunt©^ Birnbaum two-parameter model to' those re- 
sults based on the three paYaneters used to generate the data.. Both 
the Rasch and the tuo-paramete^^irnbaum models became noticeably less 
efficient when guessing was introcl'uced . The two parameter model was 
geVierally more ..efficient than the' Rasch model except at low ability 
.levels under conditions of no guessing. 

One of the potential 'advantages of the Rasch or other latent trait 
modj^s over conventional 'equating procedures is the possibility that 
^the item parameters and therefore the test calibration are invariant.. . 
That is, the estimates of the item parameters shoul'd aot depend on the 
sample used to obtain The estimates which is what Wright (1967) refers 
to 'as "person-free test calibration." Several studies (e.g., Andersdti, 
Kearney, arid Everett, 1968; Tinsley and Dawis , 1975> have found that 
the Rasch item parameter estimates have relatively good invariance for 
par.ticular seJ^of items. As might be expected, the invariance is ^ 
Jjaproved^when consideration is limited to those items that are found 
^ -it '^the Rasch^'model witliin a given confidence interval. 

ParSH^l^^ly relevant for the vertical equating problem are resul*ts 
such as thJi«L reported by Wright (196 8) -which compare estimates of ability 
based on "h^r^and "easy" tests'. This approach was useM to investigate 

"iten-free person mea*surement" claim. Using test 
students to a 48-item test, separate scores were 
each^^&tildent based on the 24 easiest items and on the 24 
As w§^ be. expected, there was a substantial difference 
in the mean raw number??^ scores 'for the easy and hard tests 
(17.16 vs. -10.38 respective>^ When estimated log ability scores were 
obtained,, the means of the twol^ were -quite similar (means of 0.464 
and 0.403 on the easy and hafd test^^pectively ) . To make a comparison 
j^ween, the- difference in raw score meaft^is^d .the difference in log 
ability means, the differences in means, can^^^e compared to the corres- 
ponding standard deviations of the dif f erences>^r raw scores, the 
mean difference is 6.78 and the standard deviation>>i^he difference 
'is 3.30"; thus, almost all the raw score differences are^ositive. For^ 
log ability on the other hand, t^e mean difference is 0.061, while the 
corresponding .standard devidtio-h is 0.749. The log ability differences 
are significantly greater than .zero (t = 2.54^ but the magnitude of the,- 
difference is small. ^ 

(19 74) report fairly vsimilaf^^^lts lot a 
(1972) datja for 949 Juibjects oirl'^^0 verbal 
the items were divided loto easy an^Wd subtests, 

e in raw score on the eaW and hard suB^s^s was 
^spending value for the logV^ility scopes^s 

only 2.15., 



.«^he adequacy 
responSfes of 
obtained for 
hardest items. 





1 jj 




ArfotKer comparison between "easy" arfd "harc^" testi' Wa's made in both 
thfe Wi^ight (1968) and I^it^ly. and Dawis (1974) studies by ^ Converting scS'res 
to "standardized differerfce scores ." ';The standard errors asSoci;;^:ed with 
a given individual's ability estimate on',,the ha^d and -easy tests ^e used 
along with the tw(^ ability estimates 'to obtain a "standardised d^ff^t;ence 
score", D , as follows: ^ - 1 ' * . \ 

b - b , ' ^ ^ X 
_ ne nh 



1^ 



ne nh 




where^b and b are the log ability estimates for individual n on the 

. nh ne 2,2 '"^^^^^^ 

hard and easy tests respectively, and S and S are^the estimated 

variances of the error of mea&urement associatecl^ with the individual's 
log ability estimate on the hard- and easy tests.. Wright and Panchapakesan 
(1969) provide an algorithm for obtaining the.necessary estimated error 
variances in addition to the ability and item estimates of the Rasch 
model • , — ^ ^ ; 

- Using the D scores shown above' Wright (1968)^ computed means and' 
standard deviations and no.tfed that if the log. ability estimates from the 
hard and ^^asy tests were, statistically equivalent, t;he mean should be- 
^ero and the standard deviation 1^0. The values actually obtained by 
Wright were 0.003 and 1.014 for the mean ar\d standard deviation respective- 
ly. . This result was judged to provide strong evidence for the equivalence 
of the hard and easy tests, c Although the mean of 0.057 and the standard 
deviation of 1.146 reported by l^ltely and Dawis (1974) are not as good.. 
as the values obtained by Wright, they do lend some support for the item- 
free person -measurement claim of the Rasch model. 

t 

The, resul-tss obtained by Wright and by \^itely and Dawis are vety 
\ncouraging because of their potential significance for the, vertical ^ ^ 
equating problem. There remain questions, however, about the generali.2- 
ability, of these results.. It would be desi'rable to have more infor- 
jnation about the consistency of the relative -standing of a group of 
'individuals or? two equated tests that differ -sybstantially *in difficulty. 
It would also be. desirable. tp have infptmation about the stability of 
the results when estimates are obtained f^om one sample of examinees 
and then applied to a different sample bf examinees. Finally, it would 
be helpful'to have information on whether hard^and easy tests are uniqudly 
equated 'if divergent groups-^of examinees. are used to perfoi;ra the equating. 
Analyses of some existing item response data we^e undertaken in an attempt 
to provide just such information. . • 





EMPJRICAL ANALYSES USING THE^RASCH 

.'Y ProceJur^ 

Item re^ponse*9^ta for 1,365 students on. 50 items o^ a* retired ^ 
^ form of , the 'Gollege«> Entrance Board*% Mathematics Achievenlent Test Level 
I- w^r(5 obtained'from the files of the Office of Instruct^Jonal'Resources , 
Measurement and Research Divisrion, of the Uniyersity of Illinois.'^ '^^■^^ o 
ttjtst w^soised.as the intermediate mathematics proficiency and j)lacement 
examination for all 19 73 incoming freshmen at the University of Illinois ^ 
' ^ho have not previ<>usly had ai* trigonometry course. , Based on the 1,365 " ^ 
students, items 37-M)weife discarded ^because of possible* speededness 
or because. t;he proportion of ^students correctly responding, to a given 
item, p, was less thin 0.20 or greater than 0.80. With but a few ^x- 
ceptions, items 37-50 4iad^p\dlues less tjian 0.15 and th^ ones which . - 
did not were very close t(^ "^5^0 and had associated proportiojis omitting 
^ equal to 0.40 -or greater. >f^the 36 items retained, th^ p values ranged 
frpm 0.22 to 0.77 elccept for' two "items which had p values of "0.82 and ^ 
0.81^with associated proportipns omitting equal to 6.01 and 0.04. refe- * 
pectively. The p values were 'also used to^. create two subtests ^ An ' ; / 

"easy" tesK;^ri\is ted of the 18 items with' the Jiighes'tf p values and a 
"difficult" ^st consisted o£ the 18 item^with J:he lowest p values. In 
addition to eliminating several' items , any st:tiden^t who responded 
correctly or incorrectly to , all 36 items or to the\^two^l8 item sub^tests 



was eliminated from all analyses. This was done because no information . 
can be obtained for the item Analyses from students wbi> respond at thes^, 
two extremes. Of the 58 stu4ents eliminated* it is, Tdi^>>6lu^ , 
. that a student could have a score of '0 or 18 on one of the 'subtests 
but be usefully included in the total test ajialyses, but fpr simplicity ^ 
these few students\were also eliminated. '"v 

The. complete sgt; ^pf group/test combinations that were utili^^d-~tn--t;his 
study is summarized 'in^ Table 5-9. Nine sets of parameters were obtained ^ 
corresponding tp the.crossing of the "three possible tests (dif f icult , 
easy, .and tptalf) and the thr^e examinee groups used 'for estimati<5ti (high, 
lowj-and-total) . As indicated in Table 5-9, these group/Jtest combina- 
tions will_be referred to by tw6, letters identifying the test, then the 
group-* ^ For^example, estimates based on the difficult test and" the low 
group are labelled 0^. The other possible labels ^re specified in Table 

. ^^^H. . . . 

The division of items^ int-^easy, arid difficult subtests^'is in line wi^th 
the subtests used by VJright (1967) and one of .the pairs af subtests in- 
vestigated by Whrtely and Dawis (1974) and therefore some' of the analyses 
presented here parallel their analyses. However, in addition to using^ the 
total sample, three\S4jbpopulations ^of examinees wel^e formed according 
'to their "ability" l^x^. The exainihees were assigned to a: "high" group 
if tiiey ilaa 21 or more\;tetn$ correct on the total 36-item test. * With 
46 or fewer items correc^V ex^inees' were assigned to a "low" gtoup'. The 
remaining examinees who had scores between 17 and 20 were retained in 
a ^'middle" groupi This split'assigned-490 examinees to the high group, 
483 to the 1^ group, and the* remaining '334 to the middle group. . 



ERIC 



* We wish to thank Dr. David Frisbe.^f or providing, ua with access, to these 
data.' , - ^ ' ' 

\ 105 ^ ' 



ERIC 



' 5-20 

Item {)arameter estimate? for all ^36 items yere obtained tor each),of. 
the three groups via .the Uxight and Panchapakesan (1969) computer pro-^ 
gr)^. These three sets of '36-item parameter estimates were then used 
as the values of the. item parameters for the easy and difficult tests fpr 
each of the t;hree appropriate gi^oups. For example, the 18-item .jiara- 
meter estimates' corresponding to the 18 easiest items obtained for TT * 
were used fpr CT and the other 18-item parameter estimates were used 
for DT. Al)ility estimates were -then computed by the iterative N^wton- 
,Ralphson procedure ^ given that the items were already calibrated. 
However in addition to obtaining ability ^estimates that used previously, 
calibrated i^ens, it was decided to compare these ability estimates 
with ones that used- no prior information for the item parameters. Th^ 
Pearson-product moment correlati-on between the two ability estimates 
was 1.0 for the total, high and low groups. Because of these three per- < 
feet correlations, only the^ results baspd on the ability estimates ob- 
tained by using the previously calibrated items are reported. The " 
middle group was not used to obtain estimates '(other than as part of the 
^total group) but it was used to. compare the equivalence of the easy and 
diffidult tests by using the ability estimates based on the .high group 
(and also^ the low group) and applying them to the middle group. 

, ' „ ^ RESULTS 

The results for the corapar isofi of the difficult and easy tests, 
for the^ total sanple (DT and ET) are* reported in Table 5-10. These ^ 
results parallel those reported by Wright (1968) aiid by Whitely^and 
Dawis (1974). As would be expected, the means on the^ two tests are 
quite different for the number right scores, but quite similar for the 
log^abilityv scores. A t test for the difference .in means on- the number 
right score yields a value of 65.74, while the t for the difference in 
means on the' estimated log ability score is only 1.82. Piirtjiermore, 
thj^ 'mean and standard deviation of the "standardized difference scores" 
are near 0.0 and 1.0 "respectively as would be expected for statistically 
equivalent tests. Thus, based 09 the total sample the easy and diffi- ^ ^ 
cult tests appear to be well equated on the log^ability scale. 

The above comparison of easy and difficult tests wa?*repeate3\f or 
botl) the iiigh and low grolSps. These results are reported in Table 5XL1. 
The results for the estimated log al^llity scores for the high and low 
groups are less favor^l^ than those for the total group butv^they still 
provide reasonably good* support for the claim that |:he scale prrovicjes 
equivalent measurement*. The main exceptions to the support for. equiva- 
lent measurements come from .twa sources: (1)' the relatiyely large ? 
mean of the^ standardized difference s'cores obtained for the high group, 
and (2) the relatively large discrepancy between 1.0 and the standard 
deviations of 0.932 and 1.115 obtained for the standardized difference 
scores for the high and low groups respectively. * ^ 

One of the requirements stated early in this section for equating i^' 
that the conversion from raw to scale scores be unique for different ' < 
subpopulati>ons. ^To investigate this assertion, the independent c'pn- 
versions for the high and low groups were compared. If. the log ability / . 
estimate associated with a particular number right 6core for the l^igh group* ^' 

<# , • «. ^ - ' , . . 



• 107 . 




TABLE -5-10 



\ 

\ 



Comparison of Difficult >and*£asy 
i^st Results for the. Total Group 



Ea^ Test 



Difficult Test "4 \D-Lfference 



Standardized 
Difference 



•^Number Right Score 



•Mean 

Std. Error 
Std. Dev. 


11.975 
0,098. - 
3.539* 


6.514 '5.461 
0.086 '\0.083 
3.127 \3.003 


• 


Estimated Log Ability ^ . 



Mean 

Std . Error 
Std. Dev.' 



5.114 
0.030 
1.090 



•I;. 069 
"0^025 
0.903 



0.045 
q.025 
0.897- 



-0.023 
0.029 
1.039 



5-22 



TABLE 5-11 



Comparison of Difficult and Easy 
Test Results*;, for the High and Low Groups 



Statistic 


Easy Test Difficult Test ^Difference 


Standardized'*' ' ' 








Difference 




\ Nimlier Right Sco_re' 


(High Group) 




Mean 


15.131 X 9.500 


'-, 5.631 




Std. Error 


0.063 0^.108 


0.119 




Stfi. Dev. 


1.402 2.387 


4 2.634 




\ 




■ . 


t 


n 


Number Right Score 


(Low Group) ' 


» 


, ■ / ■ 

Mean 


8.453 3.797 - 


. 4.656 




Std. Error 


0.126' 0.072 - 


. ' 0.149. 




Std. Dev. 

i 


2.773\ . 1.591 


3.271 






Estimated' Log AjDility 


(High GroupO 




Mean - . - . 


0.995 . - l.OoT 


-0.010 


-0.093 


Std. E-ttov 


0.030 0.028 


0.03.7 


0.042 


Std. Dev. 


0.662 '0.611 


0.828 

• 


,\' 0.932 

V 

\. 




Estimated Log Ability (Low* Group) 




Hean 


•-0..S09 -0.832 


0.023 


' -0.037. 


Std. Errot 


■ 0.034 0.029 


0.045 


• 0.051 


Std., Dev. 


0.745 0.631 ; 


0.978* 


1.115 ■ 




5-23 



is plotted against the estimate for the low group, the^ points-should 
fall on a straight line through the origin with a slope of onfe if the 
conversion is unique. The results 6f such a plotting, of ability es- 
timates are given in Figures 5-1 and 5-2 for xhe easy and difficult* 
tests respectively. 

Inspection of Figures 5-1 and 5-2 shows th3t with the notable t 
exception of the lowest scores on the easy test (Figure 5-1), the 
points fall very nearly on a 45^ line through the origin. By^far th^" 
largest exception is for the lowest rav; score on' the easy test (Figure 
5-1) where, the estimated log ability based on the high group is much too 
low compared to the estimated log ability ba3ed on *th^ low group. This ex- 
ception occurs at the Im^est score on' the easy test where the standard er- 
ror of estimate for the viigh group is very large. Thus, the exception may 
not be considered very serious. In general, the resul^ts/^in Fifures 5-1 and 
5-2 are in plose agreemenV with the results previously.^eported by Anderson, 
et al (1968)' and by TinsleyN^nd Dawis (1975). , . , ' 

The uniqueness of the equatih^Df easy anJ^ difficult tests for 
different groups may be evaluated^monavdirectly by comparing jLhe equating 
lines obtained for different group^Sv A ImBi^^equatlng of the estimated 
log ability estimates based on the ea^ and difficult tests yields 
the solid line .shown in Figure 5-3 for me high group and the dashed 
line for the low group. These two lines wob4d coincide if the same 
conversion applied to both groups, \fnile the lines in Figure 5-3 are ^ 
reasonably close, there are noticeable differences at the high ability 
levels. For exai|iple, an estimated log ability of 2.0 on ^ the difficult 
test wou^d be linearly equated to an estimated log 'ability of about 2.1 
on the easy test when the equating is based on the high group (solid line). 
The dcomparable yalues when linear equating is based on the low group, 
however, are 2.0 and 2.5. The reason for this discrepancy can be seen 
by /referring to the values of the standard deviations reported in Table 
5-11. As noted in Table 5-11, the standaerd deviations of the log ability 
scores are more discrepant from easy to difficult tests for the low 
group than are the corresponding standard deviations for the high group. 

'The^results discussed so far suggest that the Rasch mocjel provides 
at least a rough equating of the twQ sybtests which differ markedly in 
difficulty level. Since the subtests; differ more* in difficulty than 
would adjaceht levels of a test to be vertically equated, it might still 
be argued that the approach has potential value for the vertical equating 
f>roblemI It 3hould» be recalled, however, that while the easy and difficult 
'tests may be roughly equivalent statistically, they differ substantially 
in their precision for the different levels of ability. 



As a final comparison of nhe Rasch results for^ests. of different' 
difficulty and groups of different aVility, the t>^ameter estimates 
obtained for high and low groups were applied feo' the examinees in the 
middle group.' This provides an evaluation of the adequacy of the equating 
of tests of different difiiculty when the estimates Wt^tdined from onje 
group are applied to a group Wtt an adjacent ability 





V 



I'lO 




vel . ^ 



5-24 



4.0 



<D 



UJ 

Q. , 
O 

V. 

O 



2.0 



0.0 



-2.0 



-4.0 



8.0 




ERICA 



• 





.0 -4.0 / -2.0 0. 
High Group Estimated Log 

Figure 5-1 

Plot of ^ Easy Test .Conversions 

* 'I :, ' \ ^ 

to L^tlmated Log Ability^ 



2.0 



Ability 




1 . i ' 




4.0 



4. Or 



< 2.0 

3^ 

Q> 

i 0.0 



a 
o 
o 

o 



-2.0 



-4.0 . -2.0 ao 2.0 s 4.0 

High. Group Estimated Log Ability 




' Figure 5-2 
Plot of" Difficult T^st' Cbnversion^' 
to ti^timatVA Log Ability , 




5-27 



The means, standard errors, andstandard deviations ^for the middle 
group- on the^^easy ^nd diffioult tests arelreported in Table 5-12. The' 
'three' sections of. Table S;-:12 provide the results^ for number riglft scores, 
estimated log ability based- on high group d3ta, and estimated log abi'lity\ 
based on low g^oup data. As^' was nhe c^se earlier7~The me^ns, on the two 
tests aire'tjuite different ,for the 'nuipbey- right scores (t ^ 41.51). 
Howeve,rj the results b^sed on the logiabiHty estimates are not as 
good as the^ 'corresponding results reported when ability ,estiWtes were 
applied to the jsame group." I^Ke vaiue^ of t for tlj^ differ^ence between 
means on the eaky and ^difficult tests is^-3i38 whfen the abilicy estimates 
obtained JErom' tne; high group j^ere 'applied to the middle jgroup. When 
the ability estimates obtained f^om the low, group^ were applied to the 
middle group, t - 7.3A. The magnitude of these ^ differenqes between. 
means is not trivial which leads to the following generalization. A 
middle group examinee would 4o better to take the hard ^test when ability 
estimates are obtained from the high grotip-, but would do better- to take 
the easy test when the ^timate's are, obtained from th'e low group. This 
is not a very desirable feature for two tests that are .to be vertically 
equated.. In addition, even though the standard deviations of the standar- 
dized difference scores are near 1.0 wh^n either type of ability esti- 
'mates are used, the m^ans do^ differ significantly from* 0.0 in both cases. 
Clearly, the two tests ^barmot be rega^rded as statistically equivalent. 
Therefore, 'basfed on the results of obtaining ability estimates from gne 
group and applying these same estimates to a different g^roup, the 
easy and^diff ix:ult^ tests do not seem to provide equivalent measurements 
which are so necessary for longitudinal^ research. ♦ 



,CONaUJ5I0NS 



Based on > do gica^ analysis as well as the empirical comparisons of 
scaled scores on different levels ojE standardized t'ests,! which according 
to the results of the AncKor Test-^ Study have "equivalent^* raw scores, 
it must be concluded that the vertical equating of eicistl^g tests is 
often less than satisfactory. Lord (1975) has suggested that .among cur- 
rent methods of equating, only those based on item characteristic 
curve theory (i.e.,. latlnt'trait models) are appropriate for the task 
of vertical equating. Of these, the^ Rasch model is probably the 'sim-- 
'|)lest* ' But, our empirical^results raise doubts about ^the -adequacy of 
this model, ats least, foiT somie'sets of test items. ^ ^ . 

The empirical analyses involving the^Ra^ch model that are presented 
above do not suppo'rt the dual claims of item-free person measurement and 
persori-frd9 test calibration. It 'may'; be that, the comparisons reporte,d 
.aSove were more extreme^ in t^i;ms of the wide separation of the high and 
low grioups' than are apt to»be encountered when equating tests over ad- 
jacent' grades;' ^Also^ better results might ;b^ eicpected by use of an 
anchdij^est procedure . Thus,^ the^ test may be o /erly severe. It is also 
possitljp, thbt more cArefu!. selection of items tiajt/fit the m)Ddel|jL||,s 
neces$4ry, [^hich is qhe a]iproa|ch ' th^^t ^eems' to :)e: ^uggei^t^ji jby tjkts and 




ly needed J- 



rted^'by '^gorl \cfc pp. 529-530). 
L on thL ieiit Jc\l iquating ,probleJ u 



quatine ,probletJ u^lng la|4i^t: trailt 
ThVs'\shoul<iyin(;lude tests of'iW^ limltb o 






5-28 



TABLE 5-12 ■ \, ' 

Comparison of Difficult and Easy 
Test Results for. tjie Middle Group 



Statistic 


'Easy Test 


' Difficult Test - Difference' 


Standardized 
Difference 


r 




. Number "Right Score 






" Mean 

Std* Error 
' Std.'Dev. 


12.437 ' 
0.082 
1.495 


6.063 * 
0.083 . , 
1.510 

fi 


6.374 
0.154 
2.806 




• Estimated Log Ability 
Based on High Group Data 


. Mean* . 
Std. Error 
Std* Dev. 


-0.029 
'0.026 
0.474 , 


0.124 
0.023 
0.420 


-0.154 . 
0.04.6 
0.835 


-0.276 
0.057. 
1.040 

■ T" 




• 


^'^-^V 

• Estimated Log Ability 

Based on Ld^ Group Data 






Mean 
' ^td.^ Error 
"^StdJ Dev. 


* 

0.233 
0.023 
0.448 


-0.09O 
"0.023 
0.415 


0.323 
0.044 
0.804 


0.356. 

0.054. 

0.978 




li, 




5-29 



applicability of the Rasch model as well as investigations of models invol- 
ving more parameters. Additional work involving overlapping groups and the 
use of ^ri anchor test approach is currently underway, , 




5-30 



REFERENCES 

An'derson, Kearney, G. E., & Everett, A. V'. An^evaluation of Rasch's . 
structural model for test items. The British. Journal of Mathematical 
• and Statistical Psychology , 1968, 2_1, 231-238. 

Angoff , U. K. Scales, norjns and equivalent scores. In R. L. Thomdike 
(ed.), Educational Measurement, 2nd Edition , Washington,,^D. C: 
American Council on Education, 1971. 

Bianchlni, J. C. & Lqret , P. G. Anchor Test Study . Final Report . Project 
Report and Volumes 1 through 33 . ERIC Documents ED 092 601 through 

ED 092 634, 1974. 
} 

Birnbaum, A."^ Some latent trait models and their use in inferring an 
examinee's ability. In F. M. Lord and K. R. -^orvick. Statistical 
Theories of Mental Test Scores , Reading, MassacKusetts ; Addison- 
Wesley, 1968, chapters 17-20. 

CTB/McGraw-Kill, California Achievement Test^, 1970 edition , Monter-ey, 
California: CTB/McGraw-Hill, 1970. 

CTB/McGraw-Kill, Comprehensive Tests of Basic Skills, 196 8 edition, Mon- 
terey, California: CTB/McGraw-Hill, 1968. ' 

Educational Testing Service. School and C ollege Ability Test, -Princeton, \ 
New Jersey: Educational Testing Service, 1957. 

Uambleton, R. K. & Ttaub, R. E. Information- curves and efficiency of 
three logistic test-models. British Journal of Mathematical and 
Statistical PsychoJ-ogy , 1971, 2h, 273-281.^ 

Harcourt, Brace, Jovanovich. Metropolitan Achievement Tests. , 19>0 
edition. New York: Harcourt, Brace, Jovanovich, 1970. 

Harcourt, Brace, Jovanovich. Stanford Achievement Tests', l973^di;.tit>Ti , 
New York: ftarcourt. Brace, Jovanovich, 19 73. v>*C^' 

! \\.. - 

Lord, F. M. Ai survey of equating methods based on" item ch aracta^-isti-c- 
curve theor ^ (ETS RB 75-13). Princeton, New" Jersey:. Education^ Te^- 
^ ting ServicL, 1975. \ 

licCarthy, P. i.l Replication; An Approach to the ' Analysis of Data fr5 
' complex SurvWs , Washington, D. C: National Center for Health 
^ Statistics, iital and Health Statistics, Series 2, No. 14, 1966. 

Rasch, G. ProbaU-is.tic Models for Some Intelligence a nd Attainment TeXts 
Copenhagen: Danish Institute for Educational Research, 1960. 

Rasch G. An individualistic apprqach to item analysis. In./'sfF. 
■ Lazarsfeld and Njwl Henry (eds.). Readings in Mathematical Social ^ 
Science, Chicago:' Science Research Associates, 1566a, pp. 89-* ' ^' 
^108^ ' ' 




ERIC 



117 




5-31 " 



Rasch,,G. An item analysis whi'ch takes individual dif^e'Vences info 

account . . -British Journal of Mathematical and Statistical Psychology , 
1966b, 19, 49-^7. , * «■ , 

Tinsley, H. E. An investigation of the Rasch simple logistic model; 

sample free*iten and test calibration. Educational . and Psychological 
Measurement, 1975;:35, 325-339. _ * 

Whitely, S. E. & Dawis, R. V. The nature of objectivity with the Rasch 
model. Journal of Educational Measurement , 1974, 11, 163-178. 

'ght, B. D. Sample free test .calibration and person measurements 
Proceedings of the 1967 Invitational Conference on Testing Problems , 
Princeton, I^ew Jersey: Educational Testing Service, 1968, pp. 85- 
101, ^ 

• •* 

Wright, B. D. & Panchapakesan, N. A, procedure for sample free item ana- 
lysis. Educational aM Psychological Measurement , 1969, 29^, 23-48. 




CHAPTER 6 



APPLICATIONS 



TIIE SIMPLEX MODEL IN 




LONGITUDINAL STUDIES 



In a .variety of situations where repeated measurements are obtained 
over several points in tine, the inter correlation matrix been observed 
to hav6 particular characteristics. Typlxally the .correlations between 
measures obtained at adjacent points in timb are found to be higher 
than the "correlations between measures that are further apart in time. 
This pattern of correlations is also characteristic of Guttm^*s sijiplex 
(1955) and a number of authors have suggested that the ^imp^ex is a 
good model for explaining change over time (e.g., Hipphreys, 1960, 1968; 
Jones, 1962). * '\ V . - 



-One of the difficulties that investigators bav/e had in evaluating 
the adequacy of. the simplex model for a set of correlational data is 
that the cqrrelations are attenuated due to errors of measurement*. 
While the simplex model may be appropriate for error free measures, the 
fit to ^ - cor relations of fallilile measures may be poor due to the errors 
of measurement. Humphreys (1960) recognized this' problem and tried 
to deal with it ty estimating reliability coefficients. . v 

Another difficulty in evaluating the fit of a simplex model to 
a set of empirical^ data is, of course, sampling error, Joreskog 
(1,970) developed estimation techniques for a variety of simplex models 
including the model most'^commonly postulated for growtl\ data which he 
refers to as a quasi->larkov simplex. For example, the quasi-Markov 
simplex corresponds to the one suggested by Humphreys (1960). Joreskog'^ 
estimation procedures (e.g. , .Joreskog, Gruvaeus, and van Thillo, 1979; 
Joreskog and van T^'illo, l972)"^rova.de maximum likelihood estimates which 
allow for errors' bf measurement ^pd -yield large ^ample chi square . 
testis based on an assumption of multlivariate normality. 1 • 

1 \ - "^t^ 3 ' ■ 

1 Recently l^er|:s,l Linn and Joreskog^n prfess,^a) have shown that the 
simplex model pr^videS a^ Treasonably goo^ f it ^'od ti;ie intercor'pelations 
of achievement* test resufns^eportejd by BraciitV 
Those data were obtained^oV a^^early basis oyei 
Werts,, Linii and J6re9,kbg (Sin press, b) have l^la 

analyse the intercorrel^tions of grades ir 
that, were reported by 'Humphreys' (1963). "^ir 
Humphreys' assertion that tne da,ta fit a simjilf-e 



i 



and Hopkins (1972) . 

grades 1 through 9, 
\o used the "simplex model 
)llege ov5r 8 semesters ' 
jfeanalysis confirmed 
model . Humphreys ' be- 



lief that the reliabilities. of gtades alcross^ seniesters were equal was 
also\ supported by the '^alyses. \ . \ * 

Mln this chapter the simplex Ipoc^el wi;^ll toe briefly reviewed within the 
context of longitudinal studies. Ptocedures\ for estimating model para--, 
met^t'^ as well as correlations of gain with '^tatus atl an earlier point in 
time will be discussed. Finally, the results of application of the 
simplex taodel to several sets of longitudinal data witi be reported. 



X 



\6-.2 \ 



TIIE'^^IODEL 



\ 

The simplex modei^ ca*n be represented in. several ways (see for \ 
example Corballls, 1965, Joreskog, .1979). A conceptually appealing 
form for growth data/ however, is to assume that, in the absence of 
errors of measurement, a score at time t + 1 is a function of the score 
at time t plus an uncarrelated Increment. More specifically, a person's, 
true score at time t + 1, + 1, is assumed to be \ * 



= b.Z. +u^.^^ 



t t 



(6.1) 



where - 2. "^^ assumed to be uncorrelated with Z^. ^f, for convenience, 
the Z's are all standardized, then the correlation betweei^ and 2^ + ^ 

4$ simply b^ and the correlation between Z^ and Z^ (i > j) is the 



lould be noted that the assumption that ^ ^rifd^l^ are uncorre-- 

late^oes imply that growth is uncorrelated with previous status as 
has sometimes been assumed. Still dealing with the error free measures, 



Z^, the us 



where 



ua^>s4ef inition of growth from time t to time t + 1 is , 

+ - ' ' . (6.1) 



t + 1 
follLows: 



't + ■ t + 1, 

is, the chang^"^ '\gr9Wth". .The change in equation 6.2, 



^t 4- 1* expressed in t^e^r^ |of the component s|of equation 6.1 



+ 1 - K - 



rem elquation 



1 



J I 



whe^e a^(Z^) is^the variance of i^.' 



From\equation (6,4^) it is clear tha|t the correlat^on^be tween 
status ai: \time t .and grc^vith will be 2 
b^ will no^ equal LOj^ ^hepce the corr^i 



t and growth will be n^on-zero 



r^' only when b^ =\^1.\ TypicallV, 
lation between- statuk at time \ 



Tlie failible observed measures ar'e assumed to follow a classical 
test theory model a*t any point in time. Thus.sain observed score^ X,' 



1^0 



•at time t may be reptesented by, 

X 



Z + e 
t t 



6-3 



,(6.5) 



\ 



where e is assumed to h^ve an expected value of zero and to be uncor- 
relatedVith Z at all points in time and uI1LCorrelated^ with e at points in 
time other than t. The model may be depicted by a path analysis ^dia- 
gram as shown in Figure 6-1. 

The usual observed gain score, _^ ^, is simply the difference be- 
tween the observed. score at time t H-^ 1 and the observed score at time t. 



Thus, 



^-.l^V-.l-'^t ^ 



(6.;6) 



which in terms of the tiue change, ^, and errors of mea^rement as 



[6.7) 



Equations '(6 .6) and (6,7) are the standard equations for a siiriDle g:^n 
score expressed respectively in terms of observed scores and in\ tei?ms\of 
true gain and errors of measurement. As such, equations (6.6) ^an^ (^7j 
are independent of the assumed under lyin:^ simplex model on the error f rW 
\ measures. The relationship of + i ^^e parameters of the simplex- 
.model catvi be seen by substituting equal;ion (6.30 into equation (6.7). 

\ X ^ MATRIX FORMULATION ^ , ^ 

The 'mt>dkL' as^t lined above ^implies a particular structure ^f or the 
observed sco^^ va^Mnce? and .covarliances . This structure is most con- 



veniently re]^r^^errD6^ in 
^.Werts, Linn fSt '.JaHsk^g, 




matrix form (see for ^xample Joreskog, 1970 
a). lietr 



p \poinjLs in time^. 



: of obser^d s 



scores 




^ In order to r^latt the observed scores to the parameters of the 
simplex model let/ - ^ ^ , 



be a row vector ^of the 6ncorrelated increments and let B be. a, p x p 
matrix, witii unities down the main diagonal, Vith, elements -b^, -b^, 
. -b next to. the main diagonal on the lower left hand side, and 

* " p — 1 .s , • 

^eiros elsewhere. For exaucfjile, with p - 5, the B matrix is 



B = 



0 
0 



0 

1 

0 -b 



0 

0, 

/ 

1 



0 ' 0 



3 

0 -b, 



0 
0 
0 



With tWese definitions* ^d equation (1) the relationship of Z and 
U is given by*" ' , , , 

" • ■ . '.B^ = U 



assuming Zq = 0. ' The simplex model on tjie error free parameters can 



now be written as 



z = b"-"- U 



(6.9) 



Since B~^ 1? a lower triarjgular matrix with entries aS' illustrated be 
for the case of p = 5; 



tL^2 • 



b ■ 
1 

b. 



V3 



0 
0 

1 

b. 



0 

p. 

Q 

'h; 



0 
0 
0 

6 
1 



it can be. seen that this [formulation is equivalent to setting 'Z^ - but 
except 'for this additionar specificity equations (6. 1)' an,d' (6.9) are 
e'quivalent. I I. 



• The variance covariance matrix among the p observed variables, 
can now be specified in terms of the parameters of the simplex model 'and 
the variances of the errors of measurement^ ' ' 

• 1 

where is a diagonal matrix, with the variances of the U as entries, 

a (U ), and 0 is a diagonal matpr£x wit'h variances of the errors of 

t ' 2 , ^ . ' 

estimate as entries, a (e^) . ^ , ^ 

0 " t 

ESTIMATES ' ' ' / 



Estimates of- matrices involved in (6.10) will be denoted by- a hat 

•> ^ 

over the corresponding population matrix in (6.10) Thus, 

E = B^'-^^B'"-^ + 0^ . ^ (6 ,11) 

Unfortunately, several of the elements' of the three matrices on the right 
hand side of (6.11) are not identified (Joreskog, 1970). To achieve 
iclentif icAtipn some additional res tjjjict ions are required. One possibility 
is to arbitrarily assign fixed values to a^(e^) and a^(e^). When this 
vas done'^by Werts, Linn, and Joreskog (in press, a) the^ parameter estimates 
for the remaining elements provided a good fit to the observed variance 
covariance matrix. — 

An alternative approach tg obtaining unique estimates is to add a 
restriction to the model that the variances of the errors of measurement 
are constant over time. That is, it -is assumed that a^(e^) equals a^(e) 
for all t. With this assumption, maximum likelihood estimates of the ^ 
b ^ .the a^(u^) and of a^(e)" may be obtained using th% ACOVS program 
(Jorfes1<:6g, G^uvaeus, and van Thillo, 1970). Also obtained is a chi- 
s^uare test-of the model based on an .assumption of multivariate nor^nality. 

l)/2 unique elements in I and 2 p 



Lth this formulation there are p(p 
parameters to be .estimated (i.e., p 

a2(U )', and one value of o^(e) ).' "Tl;iis leaves (p^ « 3p')/2 degrees of 
freedom for the chi-square test. 



1 values of the b^, p values of 

2 



I With larg^ samples, the chi-square test will often be of less interest 
th^n the magnitude of the discrepancies between the variance-covariance 
^matrix implied by the parameter estimates of the models E, and the observed 
■ sample variance-covariance matrix, S. With variables that have arbi- 
trary variances as 'is frequently the case in the social sciences the sam::;^ 



pie correlation matrix, R, and the corresponding matrix implied' by the^"" 
model, R, will often be of greater interest, T.he residual matrix is 
simply the difference between the observed correlation matrix, R, and 
the estimate of the observed correlation matrix, R, that is. implied ^by 
the mo4el parameter estimates. V7ith large sample sizes the residual 
matrix is of special interest since the chl-square test will typically 
lead to a rejection of the model. A significant chi-s,quare is to be 
expected for any a priori model such as the above given a sufficiently 
large sample size. For evaluating the adequacy of the model it is im- ' 
portant also to consider the magnitude of the deviations from the model • 
The residual matrix provides this information. If a single index of 
fit is (Jesired, the root mean square of the residuals is sometimes useful 
(see for example, Linn and Werts, in press)-. 

GROOTH STATISTICS ' 

If it is decided that the fit of the data to the model is adequate, 
the parameter estimates may be used to estimate a variety of statistics 
that are ordinarily considered to be of interest in longitudinal studies. 
For example, the estimated correlation between true change from time t 
t to time t + 1 with status at tim^ t is 



• cj(Z ) 

p(^, + 1> V " "^^^ ~ ' 



(6.12) 



A 

where a(A . ^ ) is the estimated standard deviation of true change which 
is\ given by 

The estimated* reliability of the simple gain scores is 



A potential 
ditional ^stltlinat 
just two poiln 
that the.,iaodfel 



advantage of formulas such as (6.12) and! (6.13) over the tra- 



(A,, DJ = 



a2(A^) 



(6.13) 



es is that they are based on all data points 



s in time. This is only an advantage, 
is adequate for the data. 



rather than 



however I to the degree* 



Estimated co\?ariances ,or correlations between true status at any two 
points in time, s^y' t ancJ t + k, may be obtained) from the mocl'el parameter 
estimates as follows; 



I 



and 

* • * 

The covariance of 2 , , and Z along with the variahce of can in 

t 4- k t ^ ^ ^ t , 

turn be used to estimate the covariance of the true change from time t to 
t + k, A . , • and initial status; 



. ' ■ ' , 

If there were no etrors of measurement the measures at time t and 
time t + 1 would contain all the information ibout ^ j^* With errors 
ofMfieasurement , however, the observed scores at times other than t and . 
t + 1 may contribute to the prediction of A • Thus, if there were an 
interest in obtaining estimated true gains between t and t + 1 then all 



the observed scores X^, X^^ 



X might be used as predictors as is 



implied by C rpnbach and Furby ( 1970 )„ and Werts, Joreskog and Linn 
(1972) « Estimated covarlances of observed scores with the true change 
taay be obtained using the madel parameters. These covariances along 
with the observed score V^r^simcejir.povariance matrix could I then be used 



to obtain i^ultiple regression* estimates of A 



t + 1* 



The resulting estimate 



would have to be at 'least as goad as^ the more natural estimate obtained 

of little comfort, however, 
because, as shown by Tatsuoka (1975), the multiple regression estimate 



of A . , based X- , X^, ... X will be be 
*t +• 1 > X / ^ ' 

X only if the errors of measurement ire correlated, which, of course. 



ter than the one based on and 



violates the assumptions of th6 model. 



High* School Rank and' Gr-add' Point Averages for/ 
\ ' ' Eight Semesters of College 



a. Intercorrelations 



Semes ter 


HS 


1 


.2 . 


• 3 


4 ■ .5 


HS 


» 

1.000 










1 


.387 






• 


, f 


2- 


.341 


.556 


1.000 






3 


..278 


.456 


:.49o ; 


1.000 




4 


' 1270 


.439 


.445 


.562 


1.000 


■ 5 


.240 


.399' 


.418 




* .512 1.000 


6 


.256 . 


.415, 


.383 


.456 


■.4&9 .551 


■7 


.240 


.387 


.364' 


.445 


.442 .500. 


3 


.222 


.342 


.339 


.345 


.'416 •'^53__^ 








b. 


Residuals (R-R) 


Semester 


HS 


1 


2 


3 


4 ■ • 5 



'6. 



i?o;oo . 

.544 IvOOO 



.482 



.541 1.000 



. 7 



HS 


.000 




1 


.013 


r-fc009 


2 


-.010 


.001 


3 


-.016 


-.008 


4 


-.012 


-.006 


5 


-.012 


.001 


6 


.019 


.041 


7 


.021 


' .041 


8 ' 


. .024 


-.029 



.004 



.004 - 

-.006 - 
.003 

.013 - 



.004 








.007 


.009 






.001 


-.001 


.001 




.010 


-.013 


.005 


.000 


.013 


-.005 


-.0106 


.004 


.045 


.013 


-.004 


-.005 



.0P6 ,.000 



Table 6-L (Continued) 





Parameter Estimates 




Semester 


Beta ' 


Var(u) 


HS 


r 


.583 ' 


<• 

1 


.642 


.350 


2 


.939 

* 


.057 


3 


.836 


.175 


■ 4 


.958 


.041 


5 


.894 


.122 

<* 


6 


.940 


.070 


7 


.926 


.092 


8 


.904 • 


-.100 




var(e) * .417 




Chi-Square 


= 40.07 with^27 d.f. 


(P = . 



Example of fit. ^academic achievemen'b^* 



6-11 



Hunphreys (1968) observed that the intercorrelations of nigh 
school grades and grades j.n eight semesters of college followed a j 
pattern 'typical of a simplex. Werts, Linn, and Joreskog (in press,, b) 
reanalyzed Humphreys data using a simplex model and found a gpod fit. 
For illustrative purposes another analysis of these data, whihh are based 
on a sample of approximately 1,600' students i^ reported beloJ. The 
model differs 'slightly from that used by Werts, Joreskog, and (Linn. 

* ' 

The specific model used with these data is the' same as equation (6.11) 
except that the procedure used the sample correlation matrix rather 
th^ a variance-covariance matrix. The restriction that the variances of 
the errors of estimate are equal was used. A total of 9 variables (high 
school grades plus 8 semesters of college grades) were used in the analysis. 

The observed correlation matrix, R, is reported in section "a" of. 
Table 6-1. As can be seen there is a clear tendency for the correlations 
among adjacent semesters (entries nerxt to the main diagonal) to be higher 
t^ian the correlations between gra'des in more distant semesters. There are 
some reversals in the pattern, but generally the correlations get smaller 
as you move down a column, from right ,to left in a row\ or from the main 
diagonal to the lower left hand corner of the triangular section of the 
correlation matrix shown in Table ,1a. 

a ' • o 

Based on the observation of the correlation pattern a reasonably 
good fit to the simplex model might be expected. That this is the case 
is supported by the chi-square value of 40.07 which with 27 degrees of 
freedom has an associated p of approximately .051. While almost sig- 
nificant at the .05 level, with such a large sample size this would appear 
to be a quite good fit. Further support for the goodness of fit can 
b^ obtained from an inspection of section "b" of Table 2 which lists the 

residual elements (i.e., R-R) . None' of the 45 residuals in Table lb exceed 
.05 in absolute value and the root mean square of the re^siduals is only 
.015. Thus, these data fit the simplex model quite well even with the 
added assumption that the error variances are equal at all nine observa- 
tion points. 



time t and true 
6-2. Also 



The estimated correlations between txqe status at 
change from tiAe t to time t + l.are reportlkg^n T^le 
reported in Table 6-2 are the estimated reliamitifes 6f the observed 
difference scores for each time interval. All of the iiorrelations of 
true status with true change are negative. It should be noted, however J 
that this result is a consequence of, two features of this particular 
analy^sis: (1) using standardized observed scores (i.e., a correlation 
rather than a variance-covariance matrix) and (2) restricting the error 
.variances to be equal. Under these conditions the estimated variances 
of the true ''scores will be nearly equal and the value of b^ will be less 
than 1.0 which yields a negative correlation between true' status and 
true change (see equation 6.12), 



12J 



4 



Table 6-2 • 

Estimated Correlations Between Trlie Change with Previou 
Status and Reliability of Change (Grade Data) 



Time 

Interval of Change 

1 to 2 

2 to 3 

3 to A 

4 to 5 ■ 

5 to 6 

6 to .7 

7 to 8 

8 to 9 *' 



Correlation 

°f \ + 1 
^ with 



- ..19 
T .29 

- .16 

- .22 

- .17 

- .18 

- .23 



Reliability 
of Chang6 
.34 ' 
.07 
.19 
.05 
.13 
.08 
.10 
.11 



6-13 

The reliability of the change scores reported in Tab^e 6-2 are all 
quite low. As would be expected, the reliability of the change is 
highest for tine 1 (High sdhool Rank) to time 2 (first semester college 
grades) which has the lowest correlation between adjacent times. The 
saw tooth pattern of the reliabilities for changes from adjacent semes- 
ters in college is relatively consistemt with the pattern of same 
versus different acaaemic :^ears for the adjacent semesters. The re- 
liability of change from otle semester to another tends to be slightly 
lower if the two semesters ^re in the sane academic year than if they 
involve two academic years.! This corresponds to a tendency for grades 
in adjacent semesters in a single academic year to correlate somewhat 
higher- than those involving different academic years. The most notable 
feature of these reliabilities, however, is their extremely low 
magnitude. 

Anotier set of academic achievement data that illustrate the use 
of the simplex model were originally reported by Bracht and Hopkins 
(1972). Their data consisted of achievement test scores obtained at 
eight points in time (grades 1, 2, 3, 4 , 5 , 6, 7, and 9) . The scores 
were reported in grade equivalent units. Thus, the scores at least 
have the superficial appearance of a coinnon scale. 

A previous attempt to fit these data to a simplex model (Werts, 
Joreskog, and Linn, in press, a) resulted in a significant chi-square 
with p = .035. Due to the relatively large sample size (over 300) 
the significant chi-square is probably of less interest than the magni- 
tude of the residuals. Base'd, on the residuals and the root mean 
square of the residuals, however, the fit was judged to be. reasonably 
good. 

Since the detailed analysis of the Bracht and Hopkins data will 
be reported elsewhere (Werts, Linn, ^ Joreskog, in press, a), t^ey 
will not be repeated here. One aspect of the results that stands in 
sharp contrast to the above results for college^ grades is worthy of 
special note, however. The correlations of true status with true gain 
and the reliabilities of the gains were quite different in the Bracht 
and Hopkins data than they were in Humphreys' grade data.- These cor- 
relations and reliabilities ate' reported in Table 6-3. As can be 
seen in Table 6rr3, the correlations between true status and true > gain 
are positive in all cases which coti^rasts with the negative correla-_ 
tions reported in Table 6-2. Also, the reliabilities ojf the differ- 
ence scores reported in Table 6-3 are higher than the difes reported 
in Table '6-2. . \ 

As p -eviously noted, the negatil/e correlations of status and 
change rejorted in Table 6-2 are a resMlt^j^f analyzing correlations 
rather than covariances and of restrictions of the model. Since the 
variance-covariance matrix was analyzed for the r,esults in Table 6-3 
the estimated correlations might be either positive or negative. 
The fact that ehey are all positive is a result of a particular prop- 
erty of the grade equivalent scale which was discussed in Chapter 4 
in this report. That is, the variance of the grade equivalent scale 



Table 6-3 

Estimated Correlations Between True Change with Previous 
Status and Reliability of Change (3racht and Hopkins data) 



Tiie 

Interval of Change 

2 to 3 

3 to 4 

4 to 5 

5 to 6 ' 

6 to 7 



Correlation 

\ + 1 
with 

.67 
.12 
'.59 
.09 
.22 



Reliability 

of Change 
.42 
.56 
.39 
.51 
.43 



\ 



U2 



^ ' ^ 6-15 

increases with grad^ level. Thi^* i^icrease in variance with grade level 
not only [results in positive correlations between observed initial 
status and observed change but between true initial status ai\d true ^ - 
change. Vhether substantive meaning should be attached to these posi- 
tive correlations depends on one's view of the meaningfulnpss of 
increased variance with grade level. 

The higher reliabilities of the -chkrl^e scores in Table 6-3 than 
in Table 6-2 are primarily due to the higher reliabilities of* the 
achievement tests than of the grades. The achievement test reliabili- 
ties are in the 80 's and 90' s whereas the assumed common reliability 
.of grades is estimated to be only .58, ^ 

At least for the two examples mentioned above, the simplex model 
appears to yield estimates that fit the observed data reasonably well. 
When this is true, tjie model has the advantage of requiring only a 
single measure of a construct at each point in time. Alternative 
models^ which are considered elsewhere in this' report generally require 
multiple measures at each point in time. As will be seen below, the ^ 
simplex model, at least in the simple form used to analyze the data 
proves to be relatively good for some set? of longitudinal data but 
relatively poor for others. 




Although the distinction between, aptitude, an^ achievement is one 
more of degree than of kind, it remains of interest to test the fit 
of the simplex model for tests that are closer tcthe basic aptitude 
end of the continuum than the achievement end. ^ 'Aptitiide tests may 
be distinguished from achievement tests primarily in terms of breadth 
of relevant experience and recency of learning with measures at the 
.achievement end of the continuum being narrower and more recent 
Humphreys, 1^73). There is no good basis fdr postulating that apti- 
tude is fixed; Indeed, as impli^ed by AndersQd_(193.9) and more formally 
specified V Humphreys (1960), ^there is reason tb believe, that the 
simplex^ model might be quite appropriate for aptitude pleasures. An 
attempt was made to fit two sets of data involving* ability measures at 
the aptitude end of the continuum to the simplex model. The ^matrices 
of intercorrelations for both sets of data were obtained from 
Humphreys (1967). | 

The first set of data involves vocabulary test scores for 278 
children obtained yearly from grades 2 through 6. The incorrelations 
among the vocabulary scores over these five points in time are reported 
in 'section "a*' of Table 6-4. Inspection of the correlation matrix sug- 
gests that the, simplex model may not be very adequate for these data. 
Thib is suggested by a number of instances, where the correlation beti^een 
siotes obtained for grades separated by more time are as Itigh or higher 
tHaA those obtained for grades that are separated by less time. 



133. 



-Table 6-4 ; 
Vocabulary Scores froQ Grade to Grade 
(N = 278) „ 
. a. Intercorr^lations 



Grade 


2. 


3 . 


4 


5 


■ 6 


2" 


1.00 






— 




3 


.65 


1.00 








4 


.58 


.65 


1.00 




1 

V 


5 


.63 


.73 


. .72 


1.00 




6 


.56 


.68' 


.65 


.76 


1.0 




b. 


Residuals (R-R) 






Grade 


2 


3 


4 


5 


6 


2 


.000 










3 


.005 


-.005 








4 


-.010 


-.027 


" ^-^044 






5 


.018 


.027 


.001 


-.039 




6 " 


-.021 
c. 


.013 

Parameter 


-.032 

Estimates 


1024 


.C 


Grade 


■ 


Beta 


Var(u) 


• 




2 






. .737 






3 ' 




.876 


, .176- 


- 


- 


4 




.913 

V 


.074 






5 




1.039 


,,030 










.948 


.038 










• Var(e) = .263 






Chi 


Square 


= 17.76 with 5df (p = 


.0()3) 










134 







; -6-17 



Me parameter estimates for the simplex model, ^e report;ed in 
sectio^n "c'^ of Table 6-4 along with the chi-square t^t. The residuals 
(i.e. / R-R) are reported in secTion "b" of Table 4. fehi-square 
value is significant at the .01 level which suggests that thd^ model 
may not be adequate for these data. Given tKe relatively large sample 
size, however, it may still be of interest to consider the residuals. 
All of the residuals are less than .05 and the root mean sijuare of 
the residuals is .023. Thus, the model provides a reasonably good 
J±t %o the data although the model can be confidently rejected sta- 
tistically. ' * ' . 

' Ori£ possible difficulty with the model irt this particular instance 
is the Assumption that the variance of the errors of measurement are \ 
constant across time. Judging from the correlation among adjacent \ 
graded and the general tendency for measures to be less reliable at 
the early grades than at the ^ligher grades, one might suspect that \ 

a^(e ) should be les.s at grades 4, 5 and 6 thah at g^rade^ 2 and '3, This 

problem may contribute to the relatively large residuals in the diagonal 

at grades 4 and 5. 

The second set of data is based on intelligence .test scores obtained 
at 10 points in time for boys at /ages 8 through 17. The interval 
between testing was one year;* The correlations which 'were obtained from 
Humphreys (19^7) were based on data originally collected as part of the 
Harvard Growth Study. The scores that were intercorrelated are mental 
age scores. These correlations are reported in section "a" of Table 6-5. 
Residuals of observed correlations minus correlations estimated from 
_themodel are reported in Table 6-5 section "b", and the parameter esti- 
mates and'ch'i-squarr^^esr-aTe-yepar-ted--in-XabJ^-.6^^ 

The Ghi-square is again significant. An inspection of the .matrix 
of residuals, however, reveals that thg^fit is reasonably good witfh 
several notable exceptions. Th« root mean square of the residual^ is 
.035, the largest encountered so far. The magnitude of the root mean 
square is 'substantially influenced by a few large residuals. The 
four largest, residuals all involve correlations with scores obtained, 
at age 8. Removing the scores obtained at age 8 would greatly improve ^ 
the fit. For example, if at age 8 scores were deleted and the remaining 
variables had the same values of R, the root mean square residual 
would be reduced to .026. j 

I • ' -~ * 

\ . ■ PHYSICAL MEASURES 

Data were al^o available for the weight and height of 275 girls 
obtained "5^1 a yeaky basis a^ agesj 7 through 16 (Humphreys, 1967). 
Usllng the results obtained every sfecond year starting at age 7 an 
attempt was made t(j) fit these two Wts of data to the. simplex model. 



ERIC 



13 J 



6-18 



Tj^hle 6-5 



Mental Ages of Boys at Various Chronologifcal Ages 
a. Intercorr^lQ|ions |i 




10 



11 



12 



13 



14 , 



15 



16. ■ 17 ' 



1.000 

".816 1.000 



.76? 

.70A, 

.726 

.738 

.699 

.60A 



.859 
.787 
.-745 
.810 



1.000 
.85A 
.778 
.786 



Ape 8 


9 


10 • 


8 


.000 






9 


-.029 


.0^1 




> 

10 


.023 • 


-.024 


.012 


11 


•■ .091- 


-.017 


.004- 


12 


" 1 

.093 


-.001 


-.Olfi 


13 


. .053 


, .016 


-.044- 


14 


.010 


-.028 


-.003 


15 


.079 


-.025 


.033 


16 


.097 ■ 


-.015 


i .012 


17' 


,.028' 


-.006 


■ -.050 



1.000 
.864 
.785 

.802 * .806 .7 70 
.736 .775' .780 
b. Residuals (R-R) 
11 12 13 



..l.OOO 
,839 

.778 

' 1 

. 750' 
14 



1.000 \ , r 

.868 1^000 
.778 .848 



15 



1.000 



17 



.012 






.021 


-.003 




.033 


.020 


.010 


.002 


1 -.031 


.013 


.042 


I-.025 


-.027 


.047 


1 .023 


-.bl7 



.004 ! 
.021 -.015 
.02^ .011' ' 



.000 



o 

ERIC 



l3o 



6-19 







Table 6--5 (Continued) 








c. * Parameter Estimates' , , ' ^ ^ 




Age 
' ' 8 


Beta ' . 






9 

10 


• .867 
• , .919 


.193 . . ' . . . 
_.141 ". . " ^ ' ' 




11 


.952- 


.100... • . " ^ 




. 12 


.970 ' • 


•■.056 • v , ■ • ,^ 


* 


13 


.950 






14 


*• .974 


..033 ^ • .. ' " 
.070" . r - , • '• 




. 15 


.966 




16 
17 


.976 • 
. .952 


•05^ ■ . " • r - 

.068 ,- ■ ^ 






Var(e) = .135 


\ 




'Chi-Square = 200.98 with 35 df 


(p < :ooi) ■ \ ; • ' ■ 

\ 











ERIC 



137 



6-20- 



The results are reported in Tables 6-6 and 6-7 for vzeight and height 
respectively. In section "a'* of each table the ii^tercorrelations are 

.reported. The residuals are reported in section "b" and the parameter 
estimates and chi-stjuare test are reported in sec^^on "c" of 'each Table. 

. « \ ^ 

.For both weight and^height the chi-square test^ leads to a rejec- 

Vtion of ''the model.. The residual matrices, however, show a relatively 
good fit for ages 7,^9, 11, and 13 with a relatively much poorer. fit 
to the correlations involving height or weight at age 15. iThe esti- 
mated Variance .of the errors of taeasurement is zero/f or both height 
and ^fed^ht which refie'cts the high reliability of these physical 

rmesaures but is necessarily ^n underestimate. 

^ ' The apparently systematic nature of the residuals for the two sets 
• of physical. measures suggests that the simplex model is not adequate 
' for these data. -^In both cases, the fit is exceptionally good for pairs 
' of measurers that are close in time but it becomes less and less ade--^ 
quate'for pairs of measures that are further separated in time. For 
weight (Table .6-6) the average residuals for' correlations' are .098, 
,\.030., .013, and point .000 £<tv measurements separated by 3^ 2, 1, and 
d intervening measures^ respectively . A similar, though less pro- 
, ' nounced trend, can be seen for height (Table 6-7). This pattern of 

residuals Itands in contrast to those' that were observed above for the 
aptitude and achievement data. For example the averages of the 
: absolute values of the, residuals for the vocabulary data (Table 6-A) 
, were -•021, ♦ .015, 023, and .014 for measures with 3, 2, 1, and 0 inter- 
' vening measures respectively. _ ' , 

^ _ ; DIScJsSION 

Jhe abt)ve examples illustrate several points: (1) the simplex 
model appears to pi?ovide a reasohably good fit to at least some sets 
of academic aptitude\and achievenent data, (2)" where the data do not 
fit the model very well elements of residual matrix may identify par- 
ticular problem ateaS, (3) for the physical measures the pattern of • 
the residuqils suggest^ a general inadequacy of the one step model of 
the simplex. When^the fit is judged to be adequate, the simplex 
model provides a powerful tool for estimating characteristics of the 
unobserved error free measures as well as growth statistics of interest 



Table 6-6 

Weight of 275 Girls at Various Chronological Ages 
a. Intercorrelations 



7 ' 


7 


9 ■ 


11 


13 


i.oop 








9 


.880 


1.000 






11 


.810 


.906 


1.000 




13 


,755 


.840 


.921 


1.000 


15 


;7A4 


.773 


.790 ' 


.880 



b. Residuals (R-R) 



Age 
7 


7 


9 ■ 


11. 


13 


.000 








9 


.000 


.000 






H 


.013 


.000 


.^OT 




13 


.021 


.006 


, .000 


.000 


15 


.098 


.039 


.020 


.000 



c. Parameter Estimates 



7 
9 

11 
'13 
15 



Beta 



.880 
.906 
.921 
.880 
Var(e) =, .000 
Chi-Square = 40.92 with 5 "^f (p < .001) 



Var(u) 
1.000 
< ^225 
.179, 
♦ 3,51 
.225 



15 



1.000 



15. 



.000 



1 3 J 



6-22 



K \ , Table 6-7 ^ 
standing Height df" '275' Girls at Various Chroncrlogical- Ages 

a. Iiite.rcor relations • ' 
Aop 7-9 11 , 13 • 15 


7 


1.000 




y 


9 


.980 


1.000 ■ 




11 


.920 


.954 


1.000 


13 


.887 


.909 


.923 1.000 ' , 


^ • 15 


\836 


.844 


.790 .901 1.000 








A 




D* 


Residuals (R-R) . 


Aee 


7 


9 


11 1-3 > 1^ 


7 


.000 








.000 


.000 




11 \ 


-.015 


^000 


.000 


13 ' 


.024 


.028 


.000 .000 


15 


i 

.058 


.051 , 


-.042' ' .000 .000 












c • 


Parameter 


Estimates 




Age. 


Beta 






7 ♦ 




i.ood 




A' 

9 


-■ - .980' 


' .039 . . . 












• 11 


, .954 

^ 


.690 




13 


'.923 


.148 




15 


.902 


.i88 




ft 


Var(e) = 


.000 




Chi-Square 


= 122.34 with 5 df (p < .001) 



6-23 



REFERENCES 

Anderson, J. E. The limitations of in^anf and preschool tests in the 
measurement of intelligence. JournaJ. of Psycholosy. 1939, 8, 
351-379. ■ . 

Bracht, G.>H. & Hopkins, K. .D. Stability, of educational achievement. 
In Bracht, G. H., Hopkins, K. D.'& Stanley, J. C j(eds.) 
Perspectives, in Educational Measurement , Englewoodj Cliff s , 
N . J . : Preatf ce- nall^l972. ■ 'i 

Corballis, M. c'. Practice and' the simplex. Psychological Review, 
1965, 22, 399-A06. ^ 

-Cronbach, L. J & Furby, L. How we should measure "change" — or 
should we? Psychological Bulletin , 1970, 74_, 68-80. 

Guttman, L. A new- approach to factor analysis: the radex. In 

Lazarsfeld, L. (ed.) Mathematical Thinking in the Social 
Sciences, Glencoe, Illinois: -Free Press, 1954. 

Humphreys, L. G. Investigations of the simplex. Psychometrika, 1960, 
25, 313-323, V, 

Humphreys, L.,G. The fleeting nature of college aca^mic success. 
Journal of Educational Psychology , 1968, 59., 375-380. 

Humphreys L. G. Problems in personnel research. - In A. L. Fortuna- 
(ed.) Personnel Research and Systems Development , The Personnel 
. Research Laboratory, United States Air Force, Lackland Air ^ 
-Force Base, Texas, 1967, pp. 67-75... 

y 

Humphreys. L. G. The misleading aptitude-achievement distinction. 

Proceedings for the February. 1973 Invitational Conferenc e on ^ 
the Aptitude-Achievement Distinctioo ,- Mont.erey . California. . 
^ CTB/McGraw Hill. 1973. 

' Jones, m'. B. Practice as a process of simplification. Psychological ' 
' Review , 1962,. 2,7, 145-162. . ^ ^ 

Joreskog, K. G. Estimation and testing of simplex yodels The^^^ 
Journal of Mathematical and Statis tical Psychology, 1970, 2J, 1/i- 
145. ■ . 

Joreskog, K. G. , "Gravaeus , ^ van Thillo, M. ACOVS: A g'eneral computer 
program for "the analysis of covariance structures. RB 70-15 , 
P|j^iiceton, N. J.: Educational Testing Service, 1970. 



Joreskpg, K. G. & van Thillo, M. LISREL: A general computer ^ptogram 
for estimating a linear structural equation sys^tem involving 
multinle ^indicators of unmeasured variables, RB-72-56 , . 
Pr^ncdton, J.: Educational Testing, Service, 1972. 



Linn, R. L 
Walber 



& Werts, C. E. Measurement error in regression. In 
5, H.> J. (ed.) Behavioral Data Analysis , in preparation. 



Tatsuoka, K. K. Vector-geometric and Hilbert-space reformulations 

of classical test theory. Doctoral dissertation. University ^ 
of Illinois,. 1975. 

Werts, C. E., Joreskog, K. G. & Linn, R. L. A multitrait-multi-- 

me thorf^ model for studying growth. Educational and Psychologic 
cal Measurement , 1972, 32, 655-678. 

Werts, C. E., Linn, R. L. & Joreskog, K. G. 'A^simplex model ^for 
analyzing academic growth'. Educational and" Psychological 
Keasurement , in press, a. 

Werts, C. £., Linn, R, L.-& Joreskog, K. G. Reliability of college 
^ grades from longitudinal data. Educational and Psychological ^ 



Measurement, in pres^, b. 




112 



CHAPTER 7 
/- • 
- _ COJ^STANCY OF CONSTRUCT VALIDITY OVER TIME 

VThenever test scores are compared over time the extent to which they 
are measures oE a single common dimension is of concern. This is obvious- 
ly t:rue when t^e level of tHe test is changed and is a prerequisite for 
vertical equating. Hence, the concerns of this section are Aosely tied 
to those that are discussed in the chapter of this report on vertical 
equating. Even where the same form of a test is used at all *times, 
however, it is possible that different traits are measured by the test 
at different points in time. An example of such a test might be one that 
measures problem solving SkilJL at one age and memory or computational 
accuracy at a later age, • • ^ 

The problem of deciding what is measured by an instrument is basical- 
ly a problem' of construct^ validity. An important issue for longitudinal 
studies is the exten^t to which measures g^t at the same underlying' conr- 
structs in a constant' fashion over time.. If this formulation is accurate, 
then all of the procedures and considerations involved in the ongoing 
task of construct validation would apply to *the concerns of longitudinal 
measures of change. Thus, the variety of correlational, experimental, 
and logical procedures discussed by Cronbach (1971) are relevant yhen 
ajttempts are made" to measure the same trait at twa 6r more points in 
time. But, the problem- is complicated by the addition of the time di- 
mension. ' . , . - 

PA.TTEKN OF INIERCORREIATIONS 

When plotting, trends or calculating change scores it is' tyf>ically 
assmed" that the same thing is being measured at each point in time. 
From the observation that scores cTiange fropi one test administration to 
the next, however, it is not clear whether -the people have^changed along 
a given, dimension or what is measured by^ the test has- changed. ^ 

"If the correlation between pretest and posttest is 
reasonably high, we are inclined to ascribe^ change 
scores to changes in the individuals. But if the 
correlation is low, or if the pattern of correlations 
with other variables is different on the two occasions, 
wfe may suspect that the test does not measure the same 
thing on the two occasions. Once it is allowed that 
the pretest and posttest measure different things, it 
- becomes embarrassing to talk about change (Bereiter, 
1963, p. 11)." 

Bereiter's coiranents suggest that the pattern of correlations of 
the focus variable with other variables is highly relevant as evidence 
that the measures are getting at the same thing. ^Mthough this conten- 
tion is closely related to the approach that is discussed below, it 
must be acknowledged at the outset that even the existence of identical 



i 



ERIC 



7-2 

* 

correlations of the focus variable with a host of other variables would 
not guarantee that the same thing is being measured. At best, the simi- 
larity of the pattern of correlations- can improve the plausibility of 
the claim that the same thing is being measured by making alternative 
explanations seem less likely. The logical difficulty of concluding that 
similar correlation^ imp^ly measurement of the -same dimension is easily 
ignored.^ 

Suppose, for example, that at time 1 measure X^^ correlates .35, 

.15 and .18 with measures X^, X^ and X^ respectively. At time 2 the 

correlations of X, with X^, X^ and X, are .51, .44 and .49 respectively. 
1 2' 3 4 

These results might lead to a suspicion that measure X^ was measuring 

somewhat different things but that is not necessarily the case. In fact, 

both sets of correlations were derived from the same model with two 

latent traits. — At^both points in time it was assumed that each X^^ 

was a linear function of two latent traits, and Z^^ and an uncor- 

related error of measurement, e. , where j indexes- the measures and 

J t 

t indexes the time of measurement. 

»More formally the model that was used to derive the correlations 
at a particular point in time can be expressed 

X = u+ BZ + e* (1)^ 

where X is a column vector of observations on the p observed variables, 
M is a colcimn vector of p means, Z is a column vector of scores on the 
k latent traits, *B is a p by k matrij^ of weights, and e is ^ column — ^ 
vector of errors of measurement on the p measures. It is assumed that 
the elements in e are mutually uncorrelated and uncorrelated with the 
latent- traits. The above model i^, of course, simply a factor model 
except that the errors of measurement* would normally be replaced by 
specific factors. 

With the above model the variance-coyariance matrix among the ob- 
served variables is 

2 

E « B r e " (2) 

where T is the variance-covariance matrix among the latent traits and 
6^ is a diagonal p x p matrix with the error variances in the diagonal. 



14'i 



7-3 



Returning to the example of correlations of X^^ with and 
X^ at time 1 and time 2, the correlations at both points in time were 
generated with the sasne B and 6^ matrices • In both cases B was 



B = 



.7 


0 


.6 


.A 


0 


.6 


0 


.8 



and^^all the error variances were assumed to equal 1,0, At both points 
in time the variance of was also assumed £0 equal 1,0, Thus, at 
both points in time 

s 

where t refers t^o time. That-4^, precisely the same thing is being - 
measured withy the ^saB^ degree of ac^racy. Only the variance of and 
the covariance of^^^^^^^^Stttf ^2 were cTi^ged from time 1 to time 2,\^1 
observed measures remained the same linear function of two latent traits 
plus.,- an uncorrelated error of m^a'sufement with the same variance and 



while 



^12 = ^11 = \ 



"Without belaboring this admittedly artificial example further the 
main point is simply the one stated originally* Namely, the similarity 
of the pattern of correlations of a measure with a variety o"f-^her mea- 
sures at two points .in time does not imply whether the same or oirtferent 
things are being measured, 

A similar approach to making inferences- about the constancy of what 
is being measured by a variable is to compare standardized factor load- 
ings. If two sets of standardized factor loadings are equal or pro- 
pottional it is sometimes inferred that the. variables are measuring the 
same things at different points in time. Given the above arguments about 
intercorrelations, it is hardily surprising that such an inference or its 
converse based on non-proportional standardized loadings is not justified 
(see Werts, Joreskog and Lii^, 1972, pp. 673-675). 

A better approach to the problem is to compare unstandardi^ed factor 
weights. If the same latent trait is being measured then the unstandardized 
factor weights should be constant assuming a linear factor model. This of 



\ 



140 



/ 



7-4 



course is a strong assumption which may not be justified, .Within the 
model, however, different weight matrices would imply that different 
things are being measured • Unfortunately, the same B and T matrices 
do not necessarily imply the same factors. Speaking in a slightly ^ 
different context, McGaw and Joreskog note that "•••there is no mathe- 
matical basis for the inference of identity of common factors across 
populations, even in the case where common^^^ [B and T] can be fitted 
to all populations • 'it is clearly possible^ • • that identical dispersion 
matrices could be obtained from different test batteries ••• (1971, p^ 165) 
The same statement would apply within our context of the same population 
measured at two or more points in time. 

Although common B and T don't conclusively imply the identity of 
common factors at different points in time it is still of value to be 
able to reject the proposition that the common factors are the same 
when the matrices are different^ Furthermore, "•••the inference of 
identical factors seems reasonable if the •••[B and F] laHtrices are the 
same^^^ (McGaw and Joreskog, 1971, p^ 165)'\ Even if only the B 
matrices are the same as in the example used aboVfe, the same substan- 
tive interpretation seems reasonable albeit with different variances 
and interrelationships among the latent variables • * ^ ^ 

CONGENERIC MEASURES OVER TIME - 

A relatively simple yet conceptually appealing^'model for measures 
of the same trait over time is provided by the notiof? of congeneric 
measures (J'oreskog, 196^, 1971) • Except for errors of measurement, / 
conge^neric tests measure the same trait. and their true scores are 
lineatly related. As applied to the longitudinal situation' an obser- 
vation on measure j at time t, X^^, would be given by 

4 

X . = + b, zr+ e. 

jt ^jt jt 3 jt 
where y^^ is the mean, b^^ is the weight for variable j at time t, is 
the latent variable foi; variable j, arid e^^ is the error of measu^menT 
on^v^ir^able j at time t. TJie lack of a t suWrript an the Z cortesponds 
to the assiTmp^ion that measure j measures the same tVait at all points^ ^ 
in time. As usukl, the errors of measurement are assumed to be mutually 
uncorrelated and unco^l^related with the latent traits. 



Even with only observations on a single measure the hypothesis-^hat 
the measures are congeneric may be tested assuming multivariate normality 
providing observations on four or more occasions are available (Joreskog, 
1968)\ There would still be advantages to having several sets of measures, 
however, since this would provide a more powerful test of the model, 
especially the assumption that the errdr t^K^jS in the mqdel are uncor- 
related with all other variables. Although Bhe above apptoac)! is 
attractive with measures available ak riuraerous\points in time, by far 
the most typical situation encountereVin longitudinal studies is Vhere 
the same measures are obtained at only\wo pointsMn time. Also, for ^ 
most data sets involving measures of acad^Miic achievement, the simplex 



7-5 



model discussed in another chapter of this report is apt to provide a 
better fit. With only two points in time and with only a single measure 
at each occasion, no test of the model 'is possible. 

With three or more measures available at two points in time, models 
can be constructed to test whether each measure is congeneric over the 
two time points. Thfe test would not be specific to this hypothesis 
alone, however. The model would also involve specifications of the 
factor structure of the latept ^rait dispersion matrix, P, Foll'owing 
Joreskog (1968, 1971) the factor model for P maybe specified 

r = A ^ A* + ^ 

where A is a matrix of factor loadings for true scores, ^ is the variance- 
covariance matri^^ among the factors underlying the true scores and V is 
a diagonal matrix of uniquenesses. With this structure of T the full model 
niay be expressed 

Z = B(A ^ A' + ^) + 

• ' --^^ 

which may be analyzed following procedures described in Joreskog (1970). 

To illustrate this approach two small examples each involving three 
tests with scores at two points in time were selected. 

. • Example 1 ; For the iirst example data on two aritnemetic tests and 
an attitudinal measure were used. These measures were used for 75 chil- 
dren before and after an instructional pirogram in arithmetic. The 
variance-covariance matrix, for the .6 variables is reported in Table 1. 

The model specified that a given measure at two points in time is 
congeneric and that there is one common and three specific factors under- 
lying the three true scores. Thus , with 'the tests ordered tests 1, 
2, and 3 at time 1 then tests 1, 2, and 3 at time 2 as they are for the 
variance-covariance matrix in Table 1, the model specifies that the 
B matrix will have four zeros and two va-lues to be estimated in each 
column'. The pattern is 



\ 



B = 



0 - 

0 

* 

0 

.'0 



0 
0 



0 
0 



0 
* 



where the asterisks are. the values to be estimated. 



ERIC 



14 7 



7-6 



';;AisLE 7-1 
Variance-Covariance Matrix 











(Example 


1. N = 75) 




• 


■■ 




Variable 




1 


2 


3 

4— 


1 


'2 


3 






Time 


1 


1 


1 


2 ^ 


2 


2 


1. 


Arith. 1 


1 


118.50 






























2. 


Arith. 2 


1 


45.33 


46.68 










3. 


Attitude 


1 


257.46 


135.38 


2555.80 








li 


Arith. 1 


'2 


73.66 


39.82 


239.62 


94.00 
























2. 


'Atith''. 2 


■ "2 


56'. 99 


39.40 


149.62 


48.29 


58.17 




3. 


Attitude 


2 


238.21 


126.62 


1166.40 


159.15 


152.25 


1683.00 



ERIC 



1 4 3 



7-7 



Th6 A matrix has three rows and one column with all entries free, 
$ is just a scaler ofl.O, ^isa3x3 diagonal matrix with all diagonal 
entries free. Xhe maximum likelihood solution for the variance-covariance 
matrix of "example 1 is presented in Table 2. 

All three variables have substantial weights on the general factor. 
For each variable the weights in B are reasonably similar at the two 
points in time. The attitude measure has an apparently large variance 
of the errors of measurement but the true score variance is also very " 
large on this variable in comparison to the other two variables. The 
critical question regarding the above results is the, adequacy of the 
model for the data. This is answered in two ways: by a chi-square 
test. of fit and by an inspection of the matrix of residuals. The chi-- 
square for these data is 5 •95 with 3 degrees of freedom which is, not 
significant' at the .10 level. ^ 

" ' The matrix of residuals-, i.e., the observed variance-covariance 
matrix minus the variance-covariancfe matrix estimated by the model As 
repotted in Table 3. The residuals shown in'^'^ble 3 are generally sjnall 
compared to the corresponding elements *ir^ Tablfe^l. The largest residual 
not only in ab^olyte magnitude but as a ratio of^the corresponding ele-- / 
ment in Table 1 is for the Vovariance of variable 1, time 2 with variable 
3,\tlB^s^. All of the larger residuals involve variable 3 which may not 
be s\n:prrsirtg given that variable 3 is an attitude measure whereas the 
otherNrwo are a^chievement tests. 



Although the ^bove" model provides a reasonably satisfactory f,it it 
is not a very severe test of the hypothesis tK^t each measure measures the 
same thing at both points in time. A total if 18 parameters (6 in B, 
, 6 in 6^ , 3 in A, and 3 in ^) were estimated from a total of only 21 dis- 
tinct elements in the observed variancercovariance matriJc. A more severe 
test would be provided with more measures, more points^ in time or fewer 
parameters. One way tfo reduce the number of parameters is to make the 
model more restrictive. For example, the variance of the errors of 
/ measurement .of a given measure might be assumed to be equal at both points 

in time. This would. reduce the number of parameters to be estimated in 
6^ from. 6 to 3 and require a total of' IS rather than -18 parameters to be 
estimated. 

With the equal error variance r^^traint added, the parameter esti- 
mates r^gpt)rted In Table 4 were obtained for the variance-coVariance matrix 



er|c V • i fv^ 



' TABLE 7-2 
Maximum Likelihood Solution (Example 1) 



i - 
1 

2 

•3 



\ 



A Matrix 
^2-. 85 
3,21 
A^38 



*Fixed by hypothesis 



• 






B Matrix 




UXagOuaX 

Entries 


i 




1 


2 


3 


*2 
in e 


1 


1 


3.30 


.0* 


.0 


5.50 


2 


1 


.0 


1.51 


•0 


■3.82^ 


3 


1 


.0 


.0 


6.32 


34.71 


1- ■ ' 


> 

• 2- 


2.76 


' - .0- 


.t) 


5.69 " 


2 


2 


.0 


1.86 


.0 


3.12', 


3 


.2 


.0 


.0 


5.46 


26.00 



Entries 
in 



■.t)0 



1.82 
3*". 82 



\ 



ERIC 



loO 



\ 



TABLg 7-3 . 
Residual Matrj.x (Example 1) 



\ 



,:1 



3- 
.1 

2 
. 3 



-1 
1 
1 
2 
2 
2 



^ 2 



.00 

-i;i6 

-2.70 
- .07 
.15, 



7 



;oo 

-1.69 
- .99 
.00, 
8.34 



\ 



.01 






22.29 


.00 




T3f8.79 


' -56 


.00 


- .01 


-28l45 


6.88 



3 ■ 



2 



.00 




ERIC 



15i 



7-10 



\ 



TABL£ 7-4 

Maximum Likelihood Solution With Constant 
Error Variances for Each Measure (Example 1) 




i 
1 

2 
3 
1 
2 
3 



1 
1 

1 
2 
2 

2 



B Matrix 



2.54 

.0 

.0 
2.16 

.0 

.0 



■A 



i 
1 

2 
3 



.0* 
1.97 

.0 

.0 
2.31 

.0 



.0 
.0 

,5. .68 
.0 
.0 

4.15 



A Matrix 
3.68 
2.57 
5.20 



Entries 
in 6 

5.59** 

3.55 ' 
29.59 

5.59 

3.55 
29.59 

Entries 
in H' 

.00-. 

. 1.42 

4.82 



* Fixed by hypothesis 

**,The pairs of elements 1 and 4, 2- and 5, and 3 and 6 
are restrained to be equal. 



ERIC 



7^11 



in Table !♦ The solution ihown in Table 4 yields a ch±-«quar:e..of^»07 
with* 6 degrees of freedom which is not significant at the .20 level. 
IJhile the resulting residuals axe. slightly larger than those shown in 
Table 3, the- model even with th'e restriction of equal error 'variance 
for a given^easure at the two points in time appears reasonable. 

A still more restrictive model for the above data is provided by 
rejquiring that not only the error variances but the entries in B be the 
same for a given measure at « the two points in time. This is equivalent 
to the hypothesis that each measiure at time 2 is parallel to the corr 
ponding measure at time 1 except for a possible additive constant f^m 
time 1 to time 2. This is a very restrictive model for longitu(H^al 
measures. It says, in. effect, that the only two possible diff^ences 
between time 2 and time 1 measures are different means and different 
errors of measurement* The underlying true scores are identical within 
an additive constant ahd'the errors of measurement are uncorrelated and 
have equal variances. With these additional restrictions the estimates 
reported in Table 5 were obtained* 

The chi^square for the rather highly restricted solution shown in 
Table 5 is 16* 33 which* with 9 degrees of freedom (21 separate elements 
in the variance-covariance matrix minus 12 parameters to be estimated) 
has an associated p value of .06. Although not significant, this increase 
in the chi-square suggests that the model may be too restrictive. A 
test' of the additional restriction of equal regression weights is 
provided by the difference in the chi-squares associated with the solu- 
tions in Tables 4 and 5* This difference is 8.26 and with 3 degrees of 
freedom is significant at. the .05 level. This, suggests that the restric- 
tion of equal entries in B is not reasonable, j 

Example 2: As a second example, data available on three arithmetic 
subtests (subtraction, multiplication, and division) at two points in 
time were used. The variance-covariance matrix for a sample of 47 fourth 
grade students on these six variables is shown in Table 6. The maximum 
likelihood solution for the model specifying congeneric measures over 
time and one factor underlying the true scores is Sho\m in Table. 7. The 
chi-square test of the model is 11.52 which with* 3 degrees of freedom is 
significant at the .01 level. Thus, in contrast to the results for example 
1, the least restrictive model can be confidently rejected far the data in 

examp le^2-. 

y 

* \ ^ 

Part of the problem with the model may be suggested by the entries 
in the residual matrix which is shown in Table 8. ftxvee of the four largest 
residuals all involve the multiplication test. It may be that the hypothe- 
sis that a test is congeneric over time is least reasonable for the mul- 
tiplication test. 




LESS RESTRICTIVE MODELS 



- ( 

As was previously indicated,, the hypothesis of congeneric measures 
over time may be much too restrictive in most longitudinal situations. The 
notion of growth does no t j normally involve the strong assuption that the 
true score at time t is merely a linear function of the true score at ' - 



ERIC 



1 5 .3 



1 



7-12 ' ' ' 



TABLE 7-5 

Maximum Likelihood Solution vich-*^Constant Error Variance 
and Regression of Observed on True Score for Each Measure 



(Example ^1) 









B Matrix 




En t r "i p 












"2 


i 


t 


1 


2 / 


3 


in ^ 


1 

X 


1 


2 67* 


0** 


0 


5 69* 


2 


1 


.0 


' 2.47 


.0 


3.61 


3 


1 


.0 


.0 


5.79 


30.87 


1 


2 


2.6-7 


.0 


.0 


5.69 


>r 


2 


.0 


2.47 


.0 


•3.61 


3 


2 


.0 


.0 


5.79 


30.87 


















i 


A Matrix 




in 4* 






1 


3.22 




.00 






' 2^ ■ 


-2.24 




1.20 






3 


4.46 




3.85 



* Pairs of entries for a given measure are restrained 
to be equal 

Fixed by hypothe^is^' ' — ^ ' 



154 



\ 



/TABLE 7-6 

e 

Variance-Covariance Matrix 
(Example 2, N = 47) 





Variable 




1 


2 


3 


1 


2 


3 




• 


Time 


1 


1 


1 


2 


2 ;- 


■ 2 


1 


Subtraction 


1 


2.35 












2 


Multiplication 


1 


1.24 


2.54 








y 


3 


Division 


1 




' .58 


2.47 








1 


Subtraction 


2 


.70 


. .39 


.10 


1.56 






2 


Mul t ip 1 ica t±pn 


2 


.96 


.41 


.71 


.93 


2.52 




3 


Division 


2 


1.52 


c .95 


1.02 


1.10 


1.83 


3.37 



TABLE 

Maximum Likelihood ' Solution (Example 2) 



Fixed by hypothesis 











V 






B Matrix 




Entries 




• 1 


> 

2 


3 


"2 
in e 


1 


.92 


■ .0* 


.0 


1.17 


1 


• O' 


.59 


.0 • 


1.46 


-1 


.0 


.0 


.46 


1.45 


2 


67 


.0 


.0 


1.02 


2 


.0 ' 


1.01 


.0 


1.12 


2 


.0 

V 


.0 
f 


1.31 


.70 




1 


A Matrix 


f 


EntJ?5ifes 
in 4- 




1 , 


. 1.07 




.00 




' . 2 






.00 




. 3' ' 


1.20 




' .50 



• 

1 Ou*. 



7-16 

time t - 1. Rather, we would normally like to assume that the rank order 
of indiT^iduals along a given dimension may change over time. On6e the 
rank order *on the underlying 'dimension is allowed to change, however, 
there ds a difficulty in establishing whether it is the trait being 
measured or the people that aye changing. Thus, the fundamental problem 
with which we started this chapter still remains. 

^.If_.a_complete model can be specified it may sometimes be tested 
within the context of the general procedures for the analysis of covar- 
iance structures (Joreskog, 1970). In most instances, the theory is 
apt to be lacking to make this more than an approach to testing the 
reasonableness of a variety of possibilities. With three or more oc- 
casions and several measures the procedures described by Joreskog (1969) 
for factoring a multitest-multioccasion matrix should be of value, ^' 
When "^^tricted to two points in time as is typically the case, however, 
strong asstimptions about the causal structure of the unmeasured variables 
are apt to be needed. 

An approach to the problem involving multiple measures of a trail: 
at time 1 and again at time 2 as well as multiple jneasures of a second 
variable that is thought to be a determinant of growth is dxscussed by* 
Werts, et al. (1972). While potentially ^useful, their approach makes . 
heavy practical demands for a closed model* with all intercorrelated - 
determinates for final status on the'trait of interest included. It 
also requires multiple measures (at least three) of each trait. 

Several attempts were made to illustrate the approach described 
in Werts, et al, using Project TALENT results reported by Shaycoft 
(1967). We were not successful, however, in finding examples for which 
the fit was gooi^^enough to 'provide Aiseful illustrations of the approach • 
This failure^S^roBanrf^ in large part, to the artificial nature of 

the examples that were att^p>ted. The Project TALENT data collection . 
was 'not designed with such*^''analytical model in mind and the' needed 
multimethod approach to the measurement of each trait was not used. As. 
a result the identification of "methods" factors and of a causal model 
for analysis were too crude to be successful. 

Conclusions r 

The problem of deciding if it is the people or the naftrr^ Qf the 
dimension that is changing is basically a problem pf construct validity. 
As such, it is an unending process for which theory, logical analysis 
and a variety of empirical procedures are relevant. Assuming linearity, 
the procedures for the analysis of covariance structures (Joreskog, 1970) 
provide a potent>ially powerful analytical tool in ^thi^ effort. But, 
there are two major obstacles to the application of this approach. 
These are the lack of theory to guide the testing of specific hypotheses 
-an^d the requirement of multiple measifres for all but the simplest of 
hypotheses^ , I 



7-17 



« REFERENCES 

Bereiter, C. Some persisting dileramas in the measurement of change, 
(in C. W. Harris, ed.), Problems in Measuring Change . Madison, 
Wisconsin, University of Wisconsin Press,. 1963, 3-20* 

Cronbach, L. J. Test validation. (In R. L. Thomdike, ed.) Educational 
Measurement, second edition . Washington, C: American Council 
on Education, 1971, 443-507. 

Joreskog, K. G. Statistical models for congeneric test scores. Proceed- 
ings of the American Psychological Association , 76th Annual Con- 
vention,, 1968, 213-214. 

Joreskog", G. Factoring" the multitest-multioccasion matrix. Prince- 
ton, New Jersey: Educational Testing Service, Research Bulletin, 
f 69-62, 1969. 

Joreskog,.K. G. #A general method for the analysis ""of covariance struc- 
tures. Biometrika , 1970, -ST^,' 239-251. 

Jbreskog, K. G. Statistical analysis of sets of congeneric tests. 
Psychometrika, 1971, 36^, -109-133. 

JoresKog, ^K. G., Gruvaeus, G. T., & van Thillo, M. ACOVS: A^general 

computer program for the analysis of covariance structures. ^ Prince- 
ton, New Jersey: Educational Testing ^fervice. Research Bulletin 
70-15, 1970. 

* 

McGaw, B. & Joreskog, K. G. Factorial invariance of ability measures 
in groups differing in intelligence and socioeconomic status. 
British Journal of Mathematical and Statistical Psychology , 1971, 
2<r,. 154-ld8. .. .. 

< . . * 

Shaycoft, M. F. The High School Years: Growth in Cognitive Skills. ^ 
Unpublished Technical Report. Pittsburgh: The American Institutes 
fbr Research, 1967. 

Werts, C*. E., Joreskog, K. G. & Linn, R. L. A multitrait-multOJaeTthod 

model for studying growth. EducaUonal and Psychological Measure- 
ment, 1972, 32, 655-678. i [ 



153 
i 



\ • ^ 



Chapter 8 



TIME-SERIES -ANALYSIS APPLIED TO LONGITUDINAL STUDIES 



INTRODUCTION 



Time-ser'les analysis refers* to th^ body of knowledge and 
techniques that deals with the fitting of stochastic models to a series 
of observations made at successive, equally spaced time points. It 
thus differs from techniques for fitting deterministic models such as 
polynominal and multiple^^ii&gtessibn equations. Developed primarily in 
the context of industrial engineering,- economics, arid business manage- 
ment, its primary purpose heretofore has been forecast and Control. 
(Box and Tiao, 1965; Box and Jenkins, 1970; Nelson, 1973.) ^-^^^ 

X 

The application of time-series analysis to behavioral and ^ \^ 
social sciences in general, and to educational and psychological re-* • 
search in particular, has been pioneered by Campbell (1969) and Glass, ^ 
Willson and Ck)ttman' (1975) , among others. The main objective of Qj^se • 
works has been the application of the technique „to "interrupted time- 
series experiments," i.e., studies in which series of , observations both 
before and after the introduction of some experimental intervention are 
involved, and whose aim ts to examinee the nature and significance of 
th^ effects of the^ intervention, if any. 

\ . ■ ^ . 

The purpose, of this chapter are threefold.. First, to present 

" 1 . .-^ 

a more elementary . exposition of ^the -Methodology of tj^e-series ^i|alysis 
than is available in the literature to date; second, to%aint out that, 
as currently used, the method does. not take^Lb4^' account the longitu- 

:reats them^"^ 



dinal nature of the data, but rather ti 



a'S sequential 



ERIC 



I'oO 



8-2 



cross-sectional data; third, to suggest some modifications to make the 
technique specifically applicable to genuine longitudinal studies. 

THE BASIC MODELS 

A ' ■ - 

Within the rubric of linear models, the-^most general stochastic 
model for discrete timerseries obseryations is one which postulates 
that the obsei^'^<^^;qz t is expressible as a linear combination 

of an overall ''levV:^/ p^ain^ter L and random disturbances (or white 
^/ noise ) at time t and ,all prior tin^ points,' a^, ^^^2 '** * "^^^^ 



[1] "X^^- L + a^ + ^^a^^^ + ij;.2a^^2 ^ 



X 



wMch is called the eeneral discrete lineaJ stocha stic process m^del, or 
"v' the " linear 'filter" model for short. ^Tv^rder to achieve anythiilg re- 
seD<&ling tractability, we must assume that the random disturbances a^ 
are identically and independ'ently distributed random variables with 
mean 0 and variance 0^ B'or inferentik purpose we further assume 

■ ■ " 1 \ 
* that the common difetributMn is normal; iVe., 

a 'V' IND (0,0'^) ' , ^ 

t a . 

« 

' , . At first glance it may seem ^ that for any stochastic process 
expressible by Eq/[ll, it should, follow that/ 

^ « ' E(2^) = L + E(a^-H|;^a^^^+ip2a^^2'^V-) 

= 'l + E(a-^) + 'I^^ECaf,^) + ti;2E(aj._2)"+ ..• 



X , 



= L + 0 + 0+ 0 + 



= L. 



8-3 

This fallacious, however, in that the transition from the first to 
the second step is not valid unless the infinite series + 

\h a + is convergent. The necessary and sufficient condition for 

2 t-2 ' ' . oo 1^ 

this to be the case is that the coefficient 'series , I ^ (where 

^ , ^ i=0^ 

ij; =1), itself be convergent. If, and only if, thif is true,' we tran ' 

assert that E(z^) = L for all t. Thus, as a first pr:|.nciple, w^ have: 

[21 ' E(z ) = L, for all t, iff J] i(J . = K < 
^ i=0 ^ 

When this condition holds, process [1] is said fo be stationary through 
the second nx?ments . for asVe shall immediately^ see, the condition also 
implisl that the variance Var(z^Pand covarianqes between staggered 
z *5 axe independent of t. Together vUh the normality assumption for 
the distribution of a^, stationarity through the second moments assures 
^ con/pleta stationarity — i.e. , that the probability distribution oU^^ * 
is invariant with, respect to t. Intuitively, a stationary process is 
one in which the successive observations, although "meandering" in time; 
always centers around ^ fixed mean, E(z^) = L. 

Let us now verify the above-assertion tha t t he condition 
stipulated in [2] is sufficient also to guarantee that Var(lz:^) exists 
.and is independent of t. * 

Var(z^) = E(2^-U^ ^ \ „^ 

» 

^ ' ' + 2E(i^^aj.a^_^+.J;^aj.aj^^+... 



a i / 



since E(a a •) = 0 for t 5^ t because the a are assumed to be independ- 
t t %^ ^ 00 ^"^^-^ 

tly distributed. Obviousl^, the convergence of I assures . 



en 
00 



i=0 



J[ ip"^ also to be convergent.. We have tbdp sho^ that 



00 2 



[3] 



Var(zJ = a I for all t, iff l^. 
^. ^i=0 ^ i=0 



Similarly, it ca,n be shown that 



< CO. 



[4] Co.i^Xl^^ = "a jo^i'^i+J' ''^ J^o^i' = ^. 

In the literatur^^^ time-series analysis', Var(2^) for stat;ion- 
ary processes is denoted by and Cov(z^,2^^^) by yy the latter being^ 
called t^he autocovariance of la^.j . 

* A simple example of a stationary process is one for which the 
coefficients i). in [1] are given by 



' = ^ , where |^ | < 1- 



8/ 



In th 



IS case, 



? , 2 ^ ^ \2 _^ A . / ^ xl 

= l + (t> + ({). + + -'^x^ 



.a^d 



i=0 



i+j 



'l-V 



Hence, Eqs. [3] -and [4] specialize to 
[3*] . 



Y = a^/(l-(j)^) 
o a 



and 



163 



V • • • • - 

8-5 



[4*1 "y. * 0Vy(i-<f ). 

3 ^ 



S 



Mbving-Averag# Processes . * ^ 

feven simpler way in which Zq* [1] can represent a station- 
' ary process is when the coefficients are all zerofor i > q. The 
^ series' a^. + + 'i'2^t-2 ' * ' '^^^^ terminates^ith ^e term~'ip^a^_^, 

and the coefficient series I ^. = I ^. necessarily converges. The 

i=0 ^ i=0 ^ ^ ^ 



resulting process 




is called a moving-average^pro^^ of order q, abbreviated MA(q).^ 



^The phrase J'uvoving-sverage" does not -mean that the average of 
"moves" or varies with t— otherwise the process would be non- sTationary . 
It simply means that z - L is a weighted composite of the set of dis- 

■ turbaiicets through q time points back, which of course moves with t. For 

example, -with q = 2, z^ - L is a weighed composite of a^, a^ ai^a^.^* 

• z - - ^ is a weighted compasifce of a „, a and a^. It 'is the set of 
. 10 ' j 

■ a's of which z is a weighted compos'ite that moves with t. ' Note alsc*. 

y that the weight^d^.composite, a^. + ^i^t-l " ' -"^ *q^t-q' ""^^^^^ ^ 

a weighted average, since the coefficient 1, ip^, 1^2' ' ' '^q ^° 
, general, suin to unity .[as Box and Jenkins' (1970, p. 10) points outh 



For historical reasons, the phrase "moving-average" is retained even 
though it is, strictly speaking, a m'isnomer,. ^ 




ERIC , , ■ ■ ■ . • ■161 



8-6 , 



For purely historical reasons again, the coefficients (p^ 
(i>0) are replaced by -9^, so the conventional equation for an MA(q) 
process is 



[6] 



Thus, the simplest case (which turns out to be adequate for many situa- 
tions) is written as . 




= L 4- a^ - e^_^-. 



s follows from Eqs. [3] and [4] that 



2 2 
a 1 



• s 



[9] ^ 
while 
[10] 



= a for 2/^ 1 . • 



0 In addition to the variance and autocovariances of ^/arious 
lags, another imp<^tant parameter, for stationary time-series models is 
the aJtico^relat/on of lag j. I ts^, importance lies in the fact that its 
sample counterpart is one of the main. statistics used for identifying 
the appropriate model for a given serdes^of observed data, as we shall 
see later. The autocorrelation of,, lag j,_ denoted by p., is compute* 
in the usual way> as . 



r 



J /Var(2j/i/^r(2_.) 

t ». tTj 



ERIC 



A' 



j,G.) 




ERIC 



8-7 

But, since Cov(2^,z^^.) = Y. and Var(z^) = Var(z^.) = Y^, Pj 
expressed as > t - , ^-T^ 

[in pj=Yj/Yo- ' 

Thus, for MA(1) we have 

[12] = -9^/(1+6^) and =-0 for j > 1. 

In general, for MA(q) the autocorrelations of lags less than 
or equal to q a?^ non-zero, and those of lags greater than q' are zero. 
For i-nstapce, for MA(2) we have Eqs. [3] and [4] 

' Yi = o^-^+e^e^) 

Hence, 

/ 

[13] = C-Qi-«i62)/ (1+61+62^ 

= 0 for j > 2. 

' " -/ ■ • « 

Autorekressive Processeg . • • . ' 

• - ' i • . * , 

.Another important Jclas^ of processes is the autoregressive 

.process (AR) • The equation for AR is obtained by going back to the 

general linear ,f liter of Eq-.{1] and rewr^iting the right-hand '^e in 

. terms of the current disturbance and all past obseirvations . To do so, 

we first transpose tHe terms in Eq* [1] to get' 



8-8 



= 2^ - L - ^^a^_^ - i|.2a,_2 



and, noting that this holds for any time point, we, have, e.g. for t-1, 

Vl = 't-1 ' ^ ■ '^1^-2 - " • 

-substituting this in Eq. [1] in its original form, we get 

= L + a^. + 'J'i(2t-r^-'l'iat-2-Vt-3~-'-^ Vt-2 + ••• 



■ ■ - / ■ i 

from which a , has beeii eliminated. Similarly, we may suceessively 
eliminate a ^, a^^ ' ; eW-,. and ultimately get an equation of the form 



\ 



^ = ^ ^1^-1 t-2 



+ a^, 



where the coefficients tt^ are functions^ of the ^^*s and the constant 
B is a functioh of L and .the ^'^[s. The "nalne '"auto regressive i4del" 
comes from the fact' that Eq. [lA] resembles , a multiple tegrefesion equa- 
tion with'Z as the criterion variable, the f)ast obseri^'tiions ^^^^^y" . 
Zj. 2 ••• as predictors, and a^. as the error of estimate. • ^ - ^ 

Of course the serres ,'rT^2^_^ •+'^2^t-2 /" converge be- ^ 

/ t 

fore [14] has any chance of representing a sta^rionary prpc^ss. ^ But, 
as we shall see below, sucV convergence is only a necessary Cxt in- 
sufficient condition for station^ity. As before,, the simplest way to 
assure convergence of the series is to^equire .that all the coefficients 
beyond the p^^, say, shall vaaishl * Whenthis is the case, we have an. 
autoregressive process of order p, ' symbolized ARCp) . 'Again for , ^ 
historical 'reasons, the coef'ficients jr are rewritten as ()).^, and the 
, conventional equation for AR(p) -is ^ • ' 



8-9 



/ 



[15] 2, = ^-+^1^.1 + *2\-2 "^p't-p 

For the simplest case, AR(1) , we have 

[16] ^ = e+Vt-l^^-/ 

C / 

.It may be tlempting^o take the expected values of both sides 



of this equation to g^t^ 

^ / 

E(z^) = B + ({)j^E(z^_j^>~MJ, * . 

and, letting ECz^-i^ = ^5^t^' obtain 

E(Zj.) = B/(l-<{>j^). ■ ' ^ 

However, this already assumes the process to be stationary [when we 
put E(Zj.„i) = E(Zj.)]» whereas in fact it may not be- .To see why Eq, 
[16] does not automatically represent a sta)tionary process despite its 
having only two variable teras on the right, we must convert the equa- 
■ tion back to MA. forr^—i.e. , ,a linear combination of present and past 
disturbances—for wHiGh we •a'lready know the 'condition for stationarity . 

This ±/ done by using [16] with t rejplaced by" t-1 to &press 
z^_^ in terms of z^_2 and as ■ „ ' 

<-l = ^ + *lV2 + ^-1' ' . 
* ' whence "^ ^ v. ^^ 

I 

ERIC ; . . 108 



> z 

/ 



8-10 

then Z|._2 is expressed in terms of z^_^ and a^_^, and so forth. Con- 
tinuing in. this Vein, we eventually get 

2 2 
= B(l+4>j+t{)3_+..-) + ^l^t-1 Vt-2 + • 

^Ihusrborh the series in the a^'s and the series forming the multiplier 
of 6 converge if and only if 1^^! < 1. Once this condition is met, 
this equation is seen to be equivalent to / 

^1 1=0 

which is precisely the prpcess we Referred to earlier as an example of 
a moving-average process- of infinite order which nevertheless is 
stationary; S/Cl-c^j^) here plays the role of L. Thus, an AR(1) process 
is, under the condition stite4, equivalent to an MA process of infinite 
order. We thus conclude that, if and only if \<i>^\ < 1, Eq. [16] repre- 
sents a stationary process with 

[18] E(Zj.) = B/.a-*!), 

and, from Eqs . [3*] and [4*], / 

Y. ='a\^/a-<^h (j ='0,1, 2,...). 
Consequently, the autocorrelatiofi of lag j is 
[19] = y/Yq = < ^ 

Unlike for a MA process, the autocorrelation does not suddenly vanish 
after a certain lag, but steadily decreases exponentially. 

The equation for AR(1) is often written in deviation-score 
form, thus: let 



8-11 



Th^n, from Eq. [16] , 



3 



B 



Thus, the equation for AR(1) in deviation-score form, 
J 20] ^ = Vt-1 

is the same as [16] exce-pt for the absence of the constant term B. 

AR Processes of Order Two and Higher . The model equation 
for AR(2) is ^ 

[21] B+*lVl"'■Vt-2■'^•• 
0nce it is ascertained that the stationarity condition (to be specified 
later) is satisfied, we may get E<Zj.) by taking the expected values of 
both sides of [21), letting E(Zj._^) =-E(z^_2) = ^^^^ ^"^^ solving to 
obtain 

[22] » E(Zj.)-'= B/(l-<{>3^-<{>2)- 

t 

.To compute the variance and autocovariances it is convenient 
to use the deviation-score form of Eq [21]: 

[23.] = (f^z^:^ + ^^\^2 ^ \^ 



where'' " ^I^^^^i^t}' 



8-12 



Then 



r 

= *2^2 ^ ^2 

The last term in the last, step obtains because, from Eq. [23], 

E(z^a^) = E[(4>^z^_^-H>2V2+"t)\^ ' 

= ^l^^Vl^t^ ^ *2^^'^t-2^^ +2(3^.), 

and observations prior to time t one, of cpurse, independent of the 
disturbance a^ at time t. Similarly, 



and 



^2 = ^^Vt-2) *lYl': *2"Yo- 
We thus hav'e the set of equations 

> ■ 2 ■ 
Yo + V2 ^^a : 

[24] ^ Yi = 'j'lYo + 'J'z'^l • " ' - 

Y2 = *iYi+'<l'2^o ■ ; " 

"or, if we are interested only in the autocprrelatiortSlj^e may divide 
both sides of the. last' two equations of ^his set by Y^' to obtain 

17f ' 



\ 



8-13 



125] - 



L P2 = + h 



These are called the Yule-Walker equations . 

Autocorrelations of ](ag greater than 2 may be computed from 
the recursion relattion 



[26] = (^3_P.^i "^"^Z^j-Z > 

which results from 



= ^"^Vt-j^ = Et(*lVl-^2V2+\)\-jl 
^ote that Eq. [26] is formally the same as [23] without the disturbance^ 



term 



Fpr higher-order autoregJ^ssiye pr<?cesses, say AR(p) , the model ^ - 
'equation, in deviation-score form, is , - . ' 

[27]. \ - *i Vi + *2'^,.2 + • • • + Vt-p ^ \ • • . 

where "^t' ^1 '<'^~^i~^2'" '~^v^ ' • ^ 

The Yule-Walker equatipns for computing p^, 9^ • • v> Pp P number, 
and may best be displayed in matrix notation. They are: 



P- 



[28] 



ERIC 



1 
p- 



Pl P2 



P2 Pi ^ 



Pp-1 V2 ''p-S 

172 



^p-i 



P-2 



p-3 



4-. 



/ 



8-14 



Thejnatrix on the right-hand side is symmetric with (i,j)- and (j,i)- 
elements equal to P[i_j|- ' ''^"S, for instance, whlT p = 5, Eq. [28] 
reads 







"1 


Pi 




P3 


P4> 








Pi 


.1 


Pi 


P2 


P3 




P3 






Pi 


1 


Pi 


P2 


*3 


P4 




P3 


P2 


Pi 


1 


Pi 










'^3 


■P2 


Pi 


1 






Autocorrelations of lag greater than p are given by the re- 
cursion relation * ^^^^^^^^^^ 

[29] Pj = + <t'2pj_2 + V + Vj-P' ^ 

i.e., an eqxiation identical in form to the model equation [27] itself, 
except for the absence of a . ' 

The expected value of following a 'stationary AR(p) process 
is given by a simple extension of Eqs. [18] and [22], viz.: • 

■ ■ • ^ 

[30] • E(2^) = 6/(l-<t)2^-<l)^-.---<t»p)» 

/ i 

as was already anticipated when the de<riation-score model equation was 
written. 

Reciprocity between AR and MA Processes . What ;^saw in 
connection with the AR(1) model above exemplifies an interesting • 
reciproci\y that exists between autoregressive and moving-average 
processes: Y finij:e autoregressive process is equivalent to an in- 
inite m^vi/g-avetage process, while a finite moving-average process 



8-15 



"^is equivalent; to an infinite autoregressive process. However, there 
is a slight .asymmetry in the reciprocal relation. 

Even for the simplest, finite autoregressive process AR(1) 
to be stationary, it was seen that had to be less than one in 
absolute value. On the other hand, MA(1) (or any finite moving-average 
process, for that matter) is automatically stationary, as^we saw earlier. 
Nevelrtheless, there is a sense.in which the coefficient -6^ in equation- 
[7] for MA(1) needs to satisfy |e^| < 'l^n order for the process to be 
"reasonable-." To show this, let us 'rewrite the equation for MA(1) ih 
autoregressive form. 

From ffq. [7], with t replaced by t-1, we get 



a = z , - L + e,a. 



which, substituted back in [7] yields 



.2 ' 



\ 



, -"=1(14^^) - 9i Vl + ^ : Vt-2- 
Continuing in this manner, w» eventually get • 

\ = -^l^-l - ^^^-2 - ^1^-3 I- (1+61+92+...) + a^. 

Thus, even though MA(1) is known to be stationary, its rewriting in 
autoregressive form do qs not make sen^e unless |e^|' < 1. The right- 
, hand side would "explode" if |'ej ^ 1. Hence, we must require \^^.<-'^^ 
for MA(1) even though no such condition was necessary for statibnarxty ^ 
of MA(1) in its own right. This is called the invertibility' condition 
forMA(l). Analogously, the requirement l4>^| < 1 is ialled thq in- 



■' ■ , 8-16 " K . ■ 

I p . • * \ 

vertihility-concUtion for AR(1) , even thoijgh in this case theTjcondition \ 

' r 

* is necessary also for an AR(1) process tojbe stationary. ^- 

* * For AR and MA processes of higher order, the invertibility 
j conditions are more complicated, and we merely state them without 
' derivation. 

(a) For^^^n AR(p) process to be stationary, the root of the* 

characteristic equation ^ ^ 

must lie outside the unit circle. [This anticipates that 
at least some of the roots will generally be complex. For 
any real root x^, the requirement is simply that |x^| > 1. 
Note that, for p = 1, this reduces to the earlier con- • 
dition, \(\>^] < 1. ^'or then the characteristic equation 
' is 1 - (J) X = 0, whose root is x^ = iViJ)^, so that |x^| > 1 
' is ecftiivalent to < !•] * * ^' 

(b^ Fbx a MA(q) process to be meaningfull^j^expressible in 

autoregressive^fonn, the roots of the characteristic^ ^ 
equation * ^ • 

1 -e^ - 02 ^- ... 9/= 0 

I 

must lie outside the unit circle. 
THE MIXED MODEL: ARMA 



\ 



. Given the twp basic models, AR(p) and MA(q) , for stationary 
processes, it is a natural extension to form a combination of the two 
resulting in the ARMA (p,q) modely(an autore^ressive moving-average- 



ERIC ' 175 



/ - 8-17 



model of order p,q), with the equation " v ' 

2 t-2 q t-q 



The advantage, of making sijch a combination- is implicit in the above - 
discussion of the reciprocity beC(^een and MA processes. A tixi%t^ AR 
process was shown tcTbe- equivalent to an infinite MA process, ^and vice 
v^rsa. What is more to the>oint here are the equivalences in the 
opposite directions: aii infinite M process (or ;a finite one with-.a 
very large order q) may be expressible a? an AR process of very-^mall 
order, and, conversely, an. AR process of very large order p may be^ 
expressible as an MA process bf low order. Combining the two would, 
then, give us the best of two worlds, so to speak. Thus, a, stationary 
process which cannot be expressed either by a pure MA or a pure AR 
model of reasonably low order may be expressible as a mixed AEMA (p,q) 
. model with quite small orders p and q. The savings in the number of \ 
parameters to be estimated may be engrraous. - • ^ 

TJie technicality of deriving the variants, auto covariances 

i^d autocorrelations for the ARMA(p,q) model is -^^edioW although in 
prln^le it^nvolves no more than a combination of"^ the procedures 
descri^ above for MA(q) W)AR(p7 models separately. Since our 
main purpose here is' siri^f to point out the advanU&e of sometimes 
considering the combined ARMA model, we shall not go into these deriva- 
tion^'. We merely state the regults for the simplest case, ARMA(1,1) . 

. . \ 

The model equation for ARMA(l,i;) is 

r ' 

[31] • 2^ = *1^-1 ^ ^ ~ ^1^-1- 



176 



8-18 



It can be shovm that 

E(z ) = , ^ , [the same as for AR(1) ] 

« • 

and . Y. = ^^\^ * 

Note that these results reduce 'to those for when = 0, and to - 

'thbse for AR(r) when ^ 0. The autocorrelat^ns are inmedi^tely . 
obtainable by diVision: p = Y./Y^, so we shall not list their formula 
here. ' ^ 

MODELS FOR NONSTATIONARY PROCESSES 

itionary time series are seldom "literally true" descriptions 
encountered in practice, although they often provide good 
But sometimes — perhaps often in behavioral-science 
applications— -they are not even' adequate approximations, as When 
learning or growth^ is involved. ^ . 

Fortunjitely, however, many nonstationary time-series observa- 
tions that pccur in realUife exhibit what is known as homogeneous 
nonstationarity, by whichVs meant that even though "the series moves 
^^about freely without centering around a fixed mean, its behavior is' 
essentially similar ' throughD.ut the co'urse of time. When this is true, 
it often turns out that t% series forme'd by the 'successive differences 

\ f 
, between adjacent observations , 




[32] - \- • , . 

is a stationary time series. 

Sometimes, we ^ay have to form second-Order differences , 

♦ t • < * 

or even higher-order differe^nces before stationafity* is achieved. At 

any rate, the stationary models previously described for MA, AR' and 

* ' / * * • \ ' 

ARMA processes are usually found to be appl^ipabie to 'differences 

suitable .order d^Df' observgitions following a nonstatio'nary/ process. 

Thus, the most general inodel^^or nonstation^ryj^jrocesses is one in.^ 

which the d^^ order differences constitute an APKA(p,q) process, x This 

is kJnown as an integrated auto regressive movinR-av erage process of 

order .p, d, q and is symbolized ARIMA(p,d,q) . 



-^The qualifier "integrated" simply ^^ans that the terms o£ the 

' * * * * th * 

original series {z^} are sums '(of order d)* of the d ^order differences 

which follow ARMA(p,q). For* example, when d = 1, 
. ==^t^ Vl^ \ 

00 * 

i=o ' ^ . • . ' ' • 

Similarly, wheh d = 2, since w is itself the sum of present and all 

;.past V 's, it follows ^that. . ' • ^ 

t • ' 



CO ^ 00 CO ^ 

1=0- t-1 1-0 j=0 - . -".v 



a double sum of the second-order dlf f ereftc;^. 



1 7 ci 



8-20 



The equ&tion f^/ARIMACp,l,q) , written in terms af w^, is 
simply the ARMA(p,q)/equation 'for w^; i.e.-,. - 

^ / 

. r - - - . • . 

[33] ! 'Y^t-y "';^2"t-2 + • • • Vt-P- ^t' " " 2^-2 

''/-''" 
/' /' ■ —...-6a. 



q t-q 



Note, however, that there is one difference between this equation and 
^ ^ the equa^tion,"* [31], for. the ARMA(1,1) process in itself {which can 
♦ .^be readily generalized to AR>lA(p,q)1, in that [33] doU not contain 
> ^.the constant term 6- Since the mean of ARMA(^,q) is the same as that 
of AR(p), as shown for ARMA(1,1) after Eq. [31], it follows- from Eq. 
[30] that E(Wj.) =^0. Thus the average of - z^_^ over a long period 
' ' of time is approximately zero. For the original time series {z^.}, this 
» itoplies 'that eyen though it does not center, around a , fixed mean, nor 

does" it show a perpetual trend upv^rd or downward." Technically, this 
■ isAharacterized by sayinfe that z^ shows a stochastio \rend on drift, 
but not a deterministic one. This, is the situation usually treated' 
^ in time-series analysis. In 'educational research where we usually , ; 

■ ■ O' ^ ■ 

'expect learnii\g to be taking pl^€e, it may well be^that a deterministic 
trertd 'should be incorporated. This can be done simply by adding a 
' '■ non-zeroVnstant & to the right-hand side of Eq. [33] —although Box 

and Jenkins (19.70, p. 93) advise igainst assuming" a deterministic tr^nd 
unless the data give clear evidence of its presei^ce and form (linear, 

'quadratic, etc.).' Thus, the burden qf the proof seems to be on includ- 

' ' " 1 

. I 

. ing the constant term 6 to EqJ [33]'. 

- T-fiom the foregoing 'discussions , it 'is clear that npthing 



I rea 



lly njLw in the w|iy of mathematical te|:hniq«es is needed for handliing^ 



' homogei^eous nonStationary time-series. We simply take -differences of 

ERIC • « - . • ) . 



' • ' ■ • \. / . 8-21 

sufficient order to achieve statioriarity '*(as'^ j'udged by methods discussed 

below), and apply trhe methods developed for.AR(p)', MA(q) or AEMACp,q) 

processes, as the case may fee. [Note that when 9^ = 0, [33] reduces to- 
» • I « . . 

>the' equation for'an AR(p) 'Iprpc-ess in w^, while f or '(|)^ - 0 it reduces* to 

> tJiat for: a MA(q) . ] . ' 

; ' There is, however, bne new equatio1^i th^^ii^ij^^ sometimes con-, 
veni^nt to haVe ii> dealing with ARIMA(pj^^) .processes, or their specif ^ 
casps,' ARr(p,d) arid IMA (d,q) processes*' This is a re^tting of Eq. [33], 
or one-.of its more special ' instances, in terms of An § form known \ 
as the (cumulative) random-ghoclc fonn > Ve .illustrate this for the 

"simplest case, ,£he>.IMA( l#i) proc:e$>s. TKe' equatiofi .{replacing w^ by - 

* . ^ - 

-z^ ,) is ' 
t* t-i • . , • . - 

' .\ , * ^ 



or 



[341/ ,z"t-i.-*'^t'' -^ivV' ' - / v .: . V '.' •■ 

which, it'^may'b^ no^ed incidentally,- formaily. r^emibles an AEMA(1,1) ' * 



equaC'ion but . never thfeless/cahnot be so (iQ^rfs^tru^ci , "sihce* the 'autptegres- , > 
sive cofefficieiVt 'is (J)^' = 1 (cf .Eq\' [31]), thus violating the stationarity 
condition I < I.a t''., t ^ ■* \ * ' > 

Using* [34] with't replaced by t - 1, x^e have : • / ' ' 



whlcrh may be substitut^^ back in [34] to.^eliminatie z^^^i v 



•Successively eliminating i,^^^, z^^^, etc. in this manner, we eventually • 

' ' get ' • ' 

, - ^ 

•.; where the sum .df -the a\'§. extends indefinitely, into the past. It is 
'cp'nveriient to- -break th^s suip down into, two parts, 

/ • Y i- '" '■ "-'k- ■■- ■ ' - • ' ■ ^ 

•. ■ ' , r a • and I a. . ' : 

' y ■ ■ i=:'co ^ . i=k+l • ^ ■ • 

where k is an aijbitrary reference point. V7e niay then wrxte " 

=.^'^^^i^ j ' ^i- 'f^, \ 

. ' ^ , i=-<x> ^ ' X=K+1 1 . 

or. Upon denoting the first partial sum -by I^, 

: . • -. t-i, . ' 

[35].'- ^^k'-'W ^'■."t- ' 

^ ^ » *\ ' ^ ^ ' l=!crl 

From "the strict ma thei^atical. standpoint, the first partial sum 

^ • k , ' * ^ . , 

Cl-'0 ) y a.' above may not ^ even ^converge, and hence ^we have no right 
* 1 ,^ i . 

to denote this by L^^:' However, 'from the practical standpoint we may 
reasonably assume that the disturbances beyond some rempte time in the 
'■ -past should- 'no^ affect, the .present observation, so that a^ = O.-for 

Xhose remote t'ime points; It is "important to remember, however, that 
how. far back is i^emote ei?ou§h will depend on wba^^ the present time 
• point f is. Thus, is not^strictly a constant, but depends in an , 
"^indirect way on t. (This is what keeps Eq.' [35] fi*om representing a 
• 'stationary procep.") Lj^' may be%terijreted ap' the "le^el" of.thd sydt6m 
'■^at time poiht "k. ' ^ ; , ■ < 



IDENTIFYING THE PROCESS MD ESTIMATING ITS PARA>ETERS 

The foregoing concluded our discussioa--nex:essarily iiw)inplete 
because our aim was to keep it as elementary as possible—of the va^ous 
models, stationary and nonstationary, £©> -time-series observations. Ve 
now come to the practical question: given a se;^ of time-series data, 
how do we identiify which of the several models i^ appropiliate, and 
how do we estimate the parameters of the selected model? . 

It lis at this point that we part company from the traditional 
procedures of time-series analysis and progose alternative methods wnich 
we believe to be better adapted to data from long itudij^l "studies * But 
first we must outline the traditional methods and point oift the dif- 
ficulties, in applying them to longitudinal data. 

Traditional Procedures ' - 



Since, as indicated earliet, nonstati9n^ry processes of the 
ho^geneous variety are adequately modeled by stationary proc^sses~AR(p) , 
MA(q) and ARMA(p,q)"in the differences of^ suitable order, we shall con- 
fine our discTlss^^nr^^^ to stationary models* 

The behavior of the sample counterparts of the autocorrela- 
tions of various lags is, as mentioned earlier, the key to identifying 
the appropriate model for a given set of time-series data. We therefore 
first indicate how the sample autocorrelations r^ corresponding to the 
I theoretical parameters have traditionally been defined and computed^. 

Tie Sample Autocorrelation r.. Historicklly ,^ thete have befen 

1 J 1 \\ \ 



several alternative ||Gfinitfons proposed for r ^ , ])^ut the 6ne currentl:j^ 
favored is ^s followp: ^ ' / 

Given arf ob^ej:ve3^series ofOkata z^V '^r^ T /time 

\ ' 182 J ' 



8-24 



po 



ints, we compute the sample variance c^ and sample autocovariances 



of lag j, Cy as 



2 



(36) 



'j T t = l 



Where z is the sample mean 



Baled on the sample variance and autocovariances, the sample autocor-' 

^ \ \ 

relation of lag j is defined as 



[37] 



r. = c ./c , j - 1> 2, ... 
J JO 



Once the sample autocorrelations have been computed, and 
possibly plotted against j for visual ins^.^ction of their behavior, we ' 
check to see if "the trend with j corresponds approximately to the trend 
•exhibited by the theoretical autocovariancel for any of the models 



MJTTq), AR(p) or ARMA(p,q) 



Identification of an MA(q) Process . ' If an observed time 
series conforms (approximately) to an MA(q) process, this -fact is readily 
discernible by inspection of the trend 'of r^ with j. stated in the 
discussion preceding Ec}. [13], the theoretical autocorrelations for an 
,MA(q) process are non-zero for lags up to an^ including q, a^d then- 

1 , ' i> 



ibruptly irop to- zero. If 



the sample autocorrelations' show this sort 



of 'trend vjith j', we' may safely conclude that a moving- aver age model, 
adequatelj fits the d*ta, 4th ordJr equal to the las^jj for which r. I 
is substantially non-zero. For instance, if r^ alone is .'of considerable 



18d 



"8-25 



magnitude while r^, r^,--. are essentially zero ,• we conclude that the 
data are adequately modeled by an MA(1) process; if r^ and r^ are sub- 
stantially non-zero and the rest (r^,'' r^, ...) are of trivial magnitude, 
we conclude that MA(2) offers an adequate fit. 

There are significance tests available for judging when a 
sample autocorrelation is "substantially non-zero" and when it is 
'.'essentially zero" within sampling error, but we shall not discuss these 
in this brief outline. The interested reaier may refer to Box and 
Jenkins (1970, pp. 177-78), Glass et ^al. (1975j pp. 97-98) or.'Nelson . 
(1973, pi?. 71-72). 

Identification of an AR(p) Profess . Except when the observed" 
data sequence is adequately modeled by an AR(1) process, the identific- 
ation of the appropriate order p of an autoregressive process fitting 
the data ds much more difficult than in the moving- average case. 

As shown in Eq. [19], the theoretical autocorrelations for . 
AR(1). exhibit an exponential decay with increasing, lag j. If the 
sample autocorrelations more or less follow ahis pattern~i.e.., decreas-: 
ing geometrically with lag j but not 'suddenly dropping to a near-zero" 
value fro6-.a certain j on — we are fairly safe in concluding that an 
AR(1) model will fit the data adequately. • ' j - " , 

* ' When the above happy circumstdncg doesjiot prevail (and the 

• fitting of a <noving-average model has already been ruled out), things ^ ^ 
get much mote complicated. Inspectig^he behavi6r of autocorrelations 
^one Lrill not suffice, and v/e must .examine what are knovm as partial 
aut jcojrrelations . i > I , i 



The basic ratil)nale hinges on the relation 'tietween the', auto- 

■ T 

regressive cofficients (})^ and the autocorrelations -specified by 



184 



8-26 



the Yule-Walker equations ' (see Eqs. [25] and [23]). We know that for 
AR(pO the ceefficients ^ for i > p must vanish; the Yule-Walker 
equations enable us successively to estimate ^e by using the 

sample autocorrelations r. in place of the theoretical p , and hence 
to detect for what i <{)^ first becomes essentially 'zero. . 

Assuming that AR(1) has been ruled'out by tlre^r^'s not^decaying 
approximately exponentially, we wi^sh to. check if an autoregressive 
process of order 2 ox greater will fit the data. We replace the ahd 

in th^ Yule-Walker equations [25]. by their sample Estimates and 
v^y thus: ' ' ^ 

• ^1 = *l + *2^l 

> 

where we have also replaced the by to signify that we . are solving 
for estimates of (J)^. If the' s'olutiok for differs significantly from 
zero, we conclude that the order of the AR process is at least 2, and 
proceed to the next >tep of checking if the order is 3 ot greater. 
That is, we solve the. Yule-Walker equations with p = 3 for <^y 

r^ = 01 + $2^1 *3^2 " . . - ■ ^ 

% ' \ ' 

' ""Z' hll ^ *2^1 ^ *3 



•4 



If the solution fot is sd!gnificantly different from zero, Ve proceed 
to 4, and so on. Eventually, we will come to a if that does • 

I / I I 

not differ significantly from ^ero , and we then conclud^ that Ml(p*-1) 



18-: 



8-27 



offers an a^lequate fit of t\ie data (assuming, ^of course, that a pure 
AR model is^appropriate in the first place). 

Although have tiot used the term ^'partial autocorrelation 
In the above discussion, Chis is the name given to the 4>j .solved from 

^the Yule-Walker equations with p = j,'and it is conventionally denoted 

J • . ' 

the scfmple partial autocorrelation of order ji This may be com- i 
JJ 1 ^ 

puted by^KTfamer^s rule [see Box and Jenkins (1970, p. 64)] without *hav- . 

<» ' • • • 

Ing to solve the Yule-Walker equations in their entirely. 

R ' ' - ' ' . 

\ The .significance test for <J),.,is quite simple, for it has 

V 

been shown by Quenouille (1949) that the approximate standard error 
of is l//r when.d). =-0. Thus, we^ Have merely to multiply the com- 

.1 /N 

puted valufe of (p . . by /T (T being the number of ' data points') and refer 
to a noriaal-curve table. 

* 

Identification of an ARMA(p,q) Process . If a pure MA model 
has beenl- ruled out, and the partial autocorrelation does not drop 
to nonsignif icance for a'long'time (i.e.-, until j exbeeds 3 dr 4, say), 
then ve must suspect thaf admixed ARMA model may^ offer a better fit to"^ 
the data with lower orders p and q. {See discussion in section on 
the mixed model.) .Unfortunately, the identij^ation of the orders of 
an ARMA process is even more complicated a taslr than identifying the -. ^ 
order of a pure AR process. f • • 

About all we can say is .that. ..when, both sample autocorrela- *^ o- 
tions and s>ainple partial autocorrelations decline gradually rather 
than djrdpping a^uptlj to^near-zero, a''mixed process is i-ndicated. As . 
" a working rule, it may be said that it is worth considering an ARMA, ^ ^ 
modM only if both orders are no greater tha| 2; i.e.^, ARMA(1,1), 
t ARMA(1,2), Ai^MA(2,l) and ARMA(2,2) are the only model,^ that should be 



ERIC . ' * 13o 



} 



8-28 



1% 



eritertaine4 seriously after pure MA and pure AR models .of reasonably 
low order have been ruled Cut* Beyond that, it is probably more^ruit- 
ful to postulate a pohstationary mddel. 

Summary of Process-Identification Rule» . We inay summarize 
the foregoing procedure^ for identifying an appropriate stationary 
time-series process. for modeling a sequence of observed data in the form, 
of a table listing the rules-o^f-thpnb. It should always borne in 
mind that the orders should be relatively low (no higher than 3, perhaps, 
^ fpr pure MA and. AR models, and no higher than (2,2) for the mixed ARMA 
iDodel) for us to consider a stationary model seriously. ^ f 

Table 1. Behaviors of autocorrelations and partial 
autocorrelations' in various processes 



Process 


Autacor;relatipns 


Partial. Autocorrelations 


MA(q). 


Noh-zero for lags 1 through q; 
then abruptly drop to 0 


(Taper. off; but not neces- 
sary' to check) 


AR(1) 


Taper off exponentially 


Only ?^ 0 


AR(p), p>l 

\ 


Taper off according to 


. ^11' ^22' ^pp''^ 
$ . . = 0 for i > p 

") ■ 


ARMA(p,q) 


Irre-gular pattern^for lag^ 1 
through q; then t^^p^^ff 
according to 


Taper off 

* . r 



Recognizing Nonstitiogari ty . If the sample autocorrelations 



taper- off v^ry gradually |Over^ long stretch of lag; 



evidence that; a nons|:a 
ruled out 'immediately. 



ionary process is indicated 
and even if, an AR(p) process 



5, we have prima facie ^ 



MACq) is certainly 



shojfrld be appropriate, 



ERIC 



1<> ' 



f 



8-29 



it. is likely that the order p will be quite large. If the checking of 
the first few partial -autocorrelations (through (J)^^' ^^^^ confirms this 
by their bein| of considerable magnitude (^^3 ^ > 2, say), AI^Cp) may* . 
be ruled, out for practical purposes* ARMA(p,q) should probably not be 
considered unless* the partial autocorrelations taper off rather rapidl^ ^ 

Besides the above considerations, it is^always a good idea to 
mak$ a plot of against t to get a visual impression of the lack of 
statipn*:ity— although one should not rely entirely on visual "impressions . 
At any rate, if the data are from an area in educational research such 
'that learning is expected to take place within the period of observa- 
tion, it is more likely than not that the series will display non-- 
stationarity, as mentidned sever^ times earlier • Such being the case, 
it is probably wise not to expend^ a large amount of time and effort in 
seeking to make a Procrustean fit of the data series to some stationary 
mbd^* Rather, one should adopt the standpoint that nonstationarity 
exists unless dear artd quick (i.e*, with low orders p, q,. or (^, q) 
for the model) evidence is available to the contrary. 

Once" we decide an a /10ns tat ionary model, we form the fitst 
differences w = z^. " ^t^i treat the series w^, w^, w^_^ just 

as w& did the original observed series* That is, we determine the auto- 
correlations, and (if necesJary) the partial ajUtocorrelations of this 
new series and check if an^MA, AR or AB>IA model of reasonably low order 
will adequately fit the data. If so, we conclude that the original 
ries 2^, z^y \ is adequately modeled by an IMA(l,q), ARI(p,l) 

: ''^RIMA(p,l,q)^proc?fess. If not, it must be concluded that even the 



* s 



^ries of the ,f irst-oc^er differences exhibit nonstationarit|* We tjjhen 
tike the second differences v^ w^ - w^^^ and repe^at the entire seirch 

■ ■ ■ ■ ' \ ' 

IB.i . . . • 



8-30 



procedure with the series v^, v^, v^_"2' 'Fortunately, experience 

shows that wfe rarely, if ever, need to go beyond the §econd-order 



r 

differences to achieve stationarityo. 



'Estimation of Pa rameters . ' Onoe an appropriate model has b'^ea 

z ' ' - ' 

identified, we'may estimate the parameters of * the model^ byxVising'' the 



sample autocorrelat ions . ^-l4i--Z5trtline, ^what we' do is to substitute the 

sample autocorrelation* values for the [theoretical autocorrelations in 

th? equations relisting the -jatter to'^tte basic parameters, and solve , 

. - ' ' k ' ' ^ >•* 

the resulting ejquat ion(s) . ^ 

/ 

s Voir pure AR processes, the procedure is "straightforward . In 
particular', for*AR(l) we need only take the j = 1 instance^of Eq. [19] 
to^ get ' . 

fFoT somewhat greater accuracy of estimation, we might take -the first 
few instances, ' , 

*i " *i *i " '^a '^^^^1' ' * 

and get^ least-squares estimate for 2,n(j)^.] , ' 

For AR processes of higher' order , we may* substitute the values 

of r. for the' corresponding ^^^s in^he Yule-Walker ,equations^(see Eq/ 

[28]) and solve the set of linear equations for the (})^. Thus, for 

' / ' ' . . ^ ' -r^ \ - 

example, for p = 3 (beyond wltich we would 'seldom'wish to go),- the in-* 

dicated substitutions in Eq. [28] yield 



8-31 



from which we immediately get 




1 


^1 


-^2 


«-'l 


"l- 




1 


^1 






^2 


^1' 


1 




^3 



For MA^^and ARMA processes, the procedures are somewhat mbre 
coii5>licated because . thq, relations between the autocorrelations and the 
basic parameters are^ttoi^ineai^- Thus, foij AM(1), have, upon sub- 
stitution of r^ for i!n Eq* [12], ^ " ' * 

-e ■ , . ^ 



.1 + ^1- 



which is a quadratic in (j), with two , solutions 

J ■ ' ' 



[38] , . = 



\ 



2^1 



It is pasiJLy verified that th,e two solutions a^e reciprocals of each ^ 
other* ^ Hence, ^ just one of Xhera must* satisfy the invertibility con- 
dition, |9 I *<»1. This is the one, we -take aa our estimate for 9-^^* 

The equations become much m^re complicated for MA(2). ^ Sub- 
stitution of r-j^ and r^^for and p^' respectively ,*in Eqs* ^[13] yields 



1 + ej + 



and 



-e, 



^^2 = 



1 + 9^ + 9^ 



SimuitaheDus iterative solution of these' two I equations forjQ^ aAd 
lias been the usual approach. The present writer 'has found, however. 



8-32 



after some algebraic manipulations, that we may equiyalently. solve for 
from, the equation ' » 



t.39] 



^1 = V2 



2r, 



+ 1 



.1 -A - 4r2(l+eJ) 



and then obtain from 



[40] 



' I 

-1 + /l - 4r^(l-f6 J 



2r. 



This simplifies the solution in that iteration (by, e^g.,, tjie Gauss- 
Newton Metfiod) needs to be carried out only on Eq. [39], witK-one .Un- \ 
^ known, 9^. The closed expression [40] then yields Also, the 

-satisfaction'of the invejrtibility condition, that the roots of 



1 ~ e^x - .e2x = 0 



0 



"must lie outside the unit circle, is built into Eq§. [39] :^nd [40] . 
provided only that we take the solution of [39] with |6^| < 1. , 

. ' • • ^ * 

The procedures for MA processes of order q greater than 2 are, 
needless to say,, even more complicated- SimuJ^taneous iterative solution, 
by .the Gauss-Newton method, of the system of nonlinear equations ^ ' 

e 

(generalized from Eqs. 113]) 

1 -12 q-rl q ' • » 



/ " 1 + e; + 0, + ... 1 6 • 

'^1 ~ "1 • ;^2 

• 1 + 9:1+ ... +'6 



1 



t 



-0 



r = 

q 



1 + -If ... + 



is about -all- that can be hoped for. 



ERIC 



4 i) I ' 



8-33 



For mijced processes ARMA(p,q)o the e^tijnation procedure isr, 

.too complicated to. expound hare, except in gross outline, *or all but 

the simplest case, ARMA(1,1) . In the general case, recursion relations 

' (similar td* those for thq pure AR prbcess~cf. Eq. [29]) exist between 

D P P . and the auto cor r Nations of lag q. or less. Fpom 

^q+1'* ^ q+P , 

these relations, estimates ; . , 5^ for <J)^, (j)^, • • can be 

computed. Then, utilizing tihe relations between p^, p2> P^ 
the 4 's and the e.'s, we may solve for 8-, 6^, . . • , 6^ by substituting 

r. for Ot. and 6. for 6.* 
' J • J 1 1 



For the simplest 'case, ARMAU,1), the- details ari^ as follows: 
" , • ' <• ■ ' 

V 

from the fourth equation after -[31], letting j = 2, we get 



^2 = h^v 



from which (after replacing P^^ and by r^ and r^, respectively) , - 



4), = 



Then, from two other* equations following [31],' 



in which = r'/r, may be substituted, and the resulting equation solved' 
12 1 ► 

for 9^.^ (Alternative solutions, foB. 6^ will^ again be obtained^ among 

^whicl/ the one ^ satisfying the invertibility. conditio\i is chbsen.) ^ 

* ' • * ' ' , 

Now,' all the parameter estimates described. above'are, in . 
the tradi^tiona|l approach, taken to be "preliniiinary estiarates" .only. 
Alfter these a^, obtained, it -is •customary to tse maxiraum- likelihood 
niethods .[which in this case turns out to be equivalent to minimizing 



KLC 



1^2 



8- 34 



the sura, of squares for lack of fit, ICz^-z^) ], employing the preliminary 
estimates as the starting values for the complicated iterative. procedures 
""th^ have to be used. We shall not discuss this refinement here^be- • 
cause, as will* be argued later, it seems to be unnecessary whanwe, 
use the alternative estimation prpcedtire for longitudinal da-ta^to be^ 



proposed below, 



7 



.1 



Difficulties with the Traditional Procedures When Applied to Lon gitudinal 
Data ■ . 



The-reader will have noticed that, throughout the foregoing 
discussions, it was assumed- that there but one observation at 
each point in time.'.' "This is necessarily the case in economic or demog- 
raphic applications of time-s«ries^ analysis, where, for example, the • 
consumer price index or^the unemployment rate -in successive year s^ (or 
qiiarters or months) constitute the observations z^.. - ' - 

For longitudinal- data f^om an intact group being observed at 



a series of "tima poiAs 1, 2, ../, T, however, theye are N obs&rvations . 



^It' ^2t''»***' ""np ^''^ tim^^ point t(= \, 2, . , T) • That. is to 
say, instead of- a vector of d^ta 




■we have an » T.data 




(where N/ is the group size) 



^IT 



2T 



1 9 .3" 



/ ^ 

I 

it 



8-35 




9^ 

as our input ^erie§-. True, the input data can alway^. be condensed into 

* * " — ' — ~^ -- ^ -^^ " *^ ^^^y^^^ * " ^ 

a ^ow vector tiy co^idering only the group meap ^^at each time- poia| . 

Vs <^ur input (as we would be forced to d!q:^ln^ ppd'er ^to apply the tradi- 
tional procedures ag: ^ey s tand )^^^>^^^u^ does violence to 
the data, -ar)d throws away a lot- of po^enti^inf ormation contained in the 
separate rows of the data^ matrix Z-t-or, ^otherwise stated, ignores the 

-^orrelatedness of the T observations across each row. To -draw an 
analpgy with analysis of variance, it is akin to using a randomized- * . 
groups design when a repeated-measures design is the correct model to 

use. This, then, i.s the major difficulty the present writer sees with • 

/ 

th-e traditional procedures of time-series atialysis when it is /to be 
applied to data frbm longitudinal studies. 

Another dif ficulty'with the traditional procedures is.tlia 
it requires a large number of time points T at^v^ich the (single) 
observations are taken. Box and Jenkins (1 9?^-^ess^rt^Jfeat -to 

obtain a useful estimate of the autocorrelation function, we would need 

/ ' ' 

at leasTf^fifty observations [i.e., T ^ 50]' • . (Pv^3). It is ^ 

obviously too'^mucR to^ex^Ject so many tipiy , points of observation in a 

longir^j^^i^l >^tu^^ the unit of Ame is as shoT:J> as a. day, or' 

at most a month. But normally, in educational research, we ^would not >- 

be interested in such short time- units. An year, a semes^ter, or at 

least a quarter, would more likely be the interval between successive 

< 

observations. Thus, the number of timfe' points will usually be in the 
^"^range 5-20 instead of the minimum jf 50 recommended by BTox and Jenkins. 

It was precisely in an Attempt to resolve the foregoing dif- 
ficulties that the present research was undertaken. It seemed intuitively 
'clear that having^ say, N = 30 observations at each of T = 10 time .points 

. . ■■ ^ 

-' ■ I!))' 



8-36^ 



should, . in some sense^ yield ^nearly .as much information (although of 
coutse. not jUjSl: as muclr) as having 300 time points, each with one ^ 
observation. To look on^y at the 10 mean observations^ z^^, z^^y ...» 
z seems' to be a gross was te-*6f d^ta. ' » • 

It should be mentioned that Glass,- WillsAa^nd Gcttman (1975) 
hav e imp^llcl 1:ly addressed themselves to tJiis pro^fSr by .discussing (in. 

their first chapter), the disti^jption- between- unit-repetitive ar^d unit- 

i ' ^ , ' 

replicative^ designs.' The fc^er refers to the -case when ah intact 

group i^ "observed "at several] successive points in time" — i.e. , the . 

genuine longitudinal. study. The latter' refers to the case when samples ' 

from^the satne conceptual population (e.g., 'the populatioh of car drivers 

in a certain state' in successive years)— but one which does not comprise 

the same set of infiividuais over, time— ^re observed at successive time 

points. -Although they acknowledge the importance of- troth designs and 

even, point out that use of the unit-r.eplicativ^ ^^esign may sometimes be ^ « 

invaiyi^d (as when a change of composition of the ppptffation occurs f rota--' 

lit ' ^"'^"^ 

before to a'fter an interve^Qtion) , they opt to deal,. in thei^subs^quent 

chapters, solelv With procedures that are' adapted to thl^^it—r^plicative 
design.. .Thus, the; substantive e:^amples they present concern such 
phenomena ^s th^ "percentage* of ^uden^s in Ireland who passed the , 
inteikediate and senior^level examinations of the years 1879-1924," "the 
number of .traffic fatalities per ioo"', 000,000 driver mil^ in the state 
of New York for the 100 months from^ January 1951 to April 1960, afid 
the "petitim for reconciliation rate. . .in Germart states. : .prior to 
andJcTrf^ years after institution of the new Civil Code of the 

German fimpire on' January 1, ,1^00." 

One cannot, of course, fault the j^uthors for their particular 



1 9 .3 



choice of design (the uni tnreplicative design)] on which to concentrate 
In their book. Buit the fact remain^ that— val^'uable as their pipneering 
efforts in bringing time-series analysis to the attention of educational 
researchers have been — they have not specif i<;all^ considered the case 
of longitudinal '^studies, despite their frequent mention of this phrase^ 

Es'timartlon *PrQcedUres Geared to Longitudinal Studies 

. AjEter a. number of trial-and-error attempts at developing 
model-identification and parameter-estimation procedures especially 
geared J:ot,rtie application of time-serids analysis to longitudinal data 
(i/e*, the unit-repetitive design, in Glass et^ al. \s terminology) the 
ooly viable procedure discovered to date was the "obvious" one of 
utilizing the ordinary sample correlation matrix based on the data 

' ' 4 

Biajtrlx Z-*' 'This is "ob^^ious*' only in retrospect, however, since the use 
Of ' * . 

of |t:be correlation matrix^^for estimating the autocorrelations carries 
with it the assumption that the observations iz^^} for every individual 
foli|)ws "the s^me stochastic process, with the same parameters, which is 
clearly a strong assumption. (More will be said about this later.) 

d'npe it is ^decided to use the T x T sample correlation matrix 
R baskd on the data matrix Z ,for estimating the autpcorrelations, the 
details of how to d© so remain to be developed. The simplest way is 
to tre^t the avera|^'of. the correlations V., with subscripts such that 

• -T ' \" ' 

i - k t j as an esdi 

that i^y the mean of th'e correlations along the I line parallel and ad- 
jacent |to' the ijiain diagonal of R is used as an estimate of p^; the mean 
of the Correlations alot)g the next line to the left and below this is 



mate of Pj,- the theoretical autocorrelation of lag j 



19o 



8-^39 



outlined above for the traditional approach. Details are best rel- 
egated to a couple of numerical examples, one using real data and the 
other based on simulated" data. The functions of these numerical ex- 
amples^ are twofoldT" first, t;o provide some evidence of the validity 
of the proposed parameter-estimation (and hence also of the model- 
identification) procedure; and second, to illustrate tKe method £or 
detecting and significance-testing an intervention^ effect as developed 
by Glass et al. (1975). The latter is not expounded here except in the 
context of the numerical examples for two reasons. First, the present 
writer is unable to improve upon (i.e., expound in a more elementary 
fashion than) the original exposition by Glass and his coworkers. 
Second, the writfer believes that there must* be a way more consonant 
with longitudinal data for detecting and testing intervention effects, 
but has so far been unable to discover one. Hence, the method developed 
by Glass et al. is here used as a "stop-gap'^ measure rather than some- 
thing the writer would advocate ^in earnest for longitudinal studdes . 
(This is not t6 detract from its merits as a method used in conjunction 
with unit-replicative as against unit-repetitive designs.) 

NUMERICAL EXAMPLES^ 



Our first example is based on data from a study investigating 



^All computations were done by K. Tatsuoka on the pIATO system 
at the Computer-based Education Research- Laboratory, University of 
Xlli^p^'at Urbana-Champaign. 1^ 



ERLC 



8-39 



outlined above for the traditional approach. Details are best rel- 
egated to a couple of numerical examples, one using real data and the 
other based oa simulated" data • The functions of these numerical ex- 
amples? are twofoldf first, t;o provide some evidence of the validity 
of the proposed parameter-estimation (and hence also of the model- 
identification) procedure; and second, to illustrate tKe method fpr 
detecting and signif icance-testirig ^an intervention^ effect as developed 
by Glass et al. (1975). The latter is not expounded here except in the 
context of the numerical examples for two reasons. First, the present 
writer is unable to improve upon (i.e., expound in a more elementary 
fashion than) the original exposition by Glass and his coworkers. 
Second, the writ/er- believes that there must* be a way more consonant 
with longitudinal data for detecting and testing intervention effects, 
but has so far been unable to discover one. Hence, the method developed 
by Glass et al. is here used as a "stop-gap'^ measure rather than some- 
thing the writer would advocate ^in earnest for longitudinal studies . 
(This is not t6 detract from its merits as a method used in conjunction 
xdLth unit-replicative as against unit-repetitive designs.) 



NUMERICAL EXAMPLES" 



Our first example is based on data from a study investigating 



^All computations were done by K. Tatsuoka on the pIATO system 
at the Computer-based Education Research' Laboratory, University of 
Illii^:|^*at Urbana- Champaign. 



ERLC 



8-40 • • 

possible learning (or practice) effects in completing clo^'e passages.** 

Fifty-two fifth grade pupils were given three cloze, passages *^(one on 

/ 

sports, one on music and\one "miscellaneous" — all passages being taken 

f rOja ^a children's encyclopedia) to complete on each of 16 consecut;ive 

school days. The maximum possible score was 30 (10 for eacH passage). 

Complete data were abailable for ,45 of the 52 ^subjects, so our input * 

data matrix Z is of order 45 x 16* The column nneans — i.e'.,^the group 

means for the 16 days — were as shown below^ in Figure 1 shows their 

plot. No discernible learning effect is present. ^ ^ . 



"'^.8 13.56 12.47 13.11 11.60 17.07 14.13 12^00 15.53r— ^ 
z'q- zX 16.69 13.31 13.47 10. .00 13.13 12.60 12.22 12.47 

t • ^ ^ 

^ * "The correlation matrix based on the data matrix Z-is shown 
in Table^,^^ along with the estimated correlations of lags 1 through 15, 
calculated in accordance with Eq. [41]. It is seen that the ^j's de- 
cline irregularly and very gradually over the entire span of 15 lags, 
which is a sign that nonstationarity may be present. (This view is 
corroborated by the visual impression provided by Figure 1.) To make 
sure that an autoregressive process of order 2 or 3 will not pffer an 
, adequate fit,, however, let us compute the partial autocorrelation ^ 
coefficients ^'^^ and 'The Yule-Walker equations for p = 3*, with 



Th^study was conducted by a gil-aduate student, Gregory Bell, 



undeif the supervision of our colleague Steven Apher. 
^indebted to S^teve for making the data available to us. 



We are greatly 



ERLC 



8-42 



oi 








u 




X 




•u 




c 




o 




0) 




}^ 




0) 












u 




cvo 




i: 




iJ 








•H 






0) 






m 


(0 






>>rH 




o 




o 


'O 




0) 


o 




0) 










3 


> 




•H 


0) 


iJ 


0) 


:5 




o 


o 


0) 


o 


« 


0) 


c 




o 


}^ 


o 


o 






vO 




rH 


X 




•H 


C 


}^ 


O 


4J 






CD 




0) 




00 


c 


CO 


o 


CO 


•H 


CD 


4J 


CO 


(0 


a 


rH 




0) 


0) 


}^ 




}^ 


O 




rH 




O 










0) 




fH 








cd 








\ 

\ 




\ 




f 





o 
o 



O CO 
O CO 



O G 
O <*> cs| 



o <r CO o 

. o vo m <• 



o <r o vo vo 

^ >p ^ CO 



O 00 CN^ CO rH 

o m vo ^ m 



o m cs m esi 

O vo ^ 



O ON rH ON vo 
Or^vOiTi^OvO CM 



O rH06>d"*rHmin vOI^ 
O CO r^'' vO . vO ^vO vO ^ CO 



OCSOrHOOUOrHWOmHON 

Ovovo r^vor^vOvovo-^ 



oo<"r^covoaNr-Hco>^ co 
ovo^vomvommu^ioincM 



O 00 <f" .iTf ONONvOfOCrxCO CO CO 

Ouo>d" »nuovo*d-iri*d-^^ O 



O 
o 



00 CO 



rH ON 00 ON UO 

vO^ vO MO UO vO 



ON rH 



00 
CO 



O 00 O 

o vo m 



VO 



CO 00' 



<r rH o cy\ 
m vo m 



csj 
m 



o 
o 



so \vo 



m 
uo 



00 
'CO 



VO 



o 



vO 



so 



vO nn 



VO 

m 



CO 
CO 




CO 
H 



CM 
H 



O 
H 



o 



CM 



O 
H 



O 
O 



00 

o 




o 

VO 



VO 



H 

VO 



60 

CO n- 
»-5 I U 




the replaced hy are, from Eq. [28], 



8-43 



1 .621 .576- 
. 621 ^ 1 . 621 
.576 .621 1 



.621 



.576 

\60A 



since we are interested only in the Value of (Which is th^' same as 
(j)^^), we need not solve the entire system" of equations for 
Using Cramer'*s rule, we have 



*33 = 



1 .621 . 621 
,621 1 . 57^ 
.576 .621 .-60A 



.621 .576 



.621 1 
.576 .'621 



"621 
1 



,1012 
.3412 



= .297 



Although this value is judged insignificant by the traditibnal 

t 

4 



significance test, for 



= (.297) (4) = 1.19, ' ; • , 

it should be borne in jnind that the significance test is customarily 
lised in conjunction with fairly large T (> 50, say). For T; as small 
as 16, it would require a value of about .50 before it is judged 
significant. In situations like this, one ^hould ijiot rely heavily on 
significances, tests. Regardless of its statistical ' ijisignif^cance, the 
value .297 certainly a non-negligible one by any ^tandard. If we i 
were to adopt an AR model, we would certainly not be inclined to ignore 



8-44 



the third term with coefficient ,297. Thus, the order of th^ presumed 
AR process will be at least 3. 

Similarly, by solving the Yule-Walker equations with p =4 
for we get = /144, whiph is still not close enough\to zero to be 
negligible,. Thus, if we were to try to fit an AR model to the original 
• data/we^'would need the order to' be at leas't 4. ~ 

At this point, both common se^se and the principle of parsi- 
mony would suggest that,> instead ofi continuing to try to find a, stationary 
model to^'fi^-»the original data, it would be mor^ strategic to go to the., 
first differences, w^ = - ^t-l' '^^^ "data matrix** W is now of ^ » 
order 45 x 15, - 

^ Table 3 show&v the 15 x 15 correlation matrix of the w^'s, and 
the estimated autocorrelations of lags 1 through 14, again computed in 
accordance with Eq [41] • It is seen that drops abrubtly to^a near- 
' zero value for j = 2, although there are a few, sporadic values that are 
not quite so small at larger lags. (The value -.215 for r^^ may be 
discounted, since it is based on just one correlation value, ^i^^i*^ 
Thus, it seems legitimate to entertain the MA(1) model for 
the sequence of first-order differences (which implies that the original 
series follows an IMA(1,1) process). I.e., we assume that 



t-1' 



The next step is to estimate 9^ by means of Eq^ [12] with p^^ replaced by 
r^.j As -we saw earlier, this «equat ion has the solutions 



-1 ± Vl"^ 



^1 



-I 



2-1 



ERIC 



2:);3 







c 






c 
















O 


a 




O 






« 






iH 




o 






o 




O 








iH 


r 


r 




o 
o 



O 

o 



o 
o 



o <r csj 



t I I 

CNJ CO 00 0> 0^ 
O O O P' CSJ 



O 
O 
t.' 



CO o r>. >d' 



CSJ 



iH tn m 

O iH iH 



8 



»-H CO in CO CSJ ' csj 00 
CS rH O O O O O 



o 
m 
o 



o o 

iH O 



o 


00 


O 










m 


o 


m 


O 


o 


iH 






00 












w 




o 


rH 


1 






1 


2: 
















o 
















M 








iH 




CO 


O 






m 




O 




CSJ 






OQ 




1 


' 1 




1 








o 

r 


p>. 


CO 


in 


O 




o 






o 


O 


o 


CSJ 


o 


o 




o 
















00 



o 
. I 

CM 

\0 o 



o 

in m 



o 
o 



ON ON CO 

rH o * eg 



o tn rH 
cn o o 



eg 



o 
o 



vO O 

in CO csi 



>^ vD ON vorrH co*co 
iHOrHO\fHcgrg iH 



CO 



O 



o 



O >d- >d" 
O CO eg 



00 iH 00 CNj 00 

O O O O O rH 



eg 



o 



o eg ^x> tn ON 

>d- O CO iH, 



iH O 00 00 iH 

O iH iH O O 



I ' 



oinOiHcgooOiHrH 



I 

rH 
Cn4 



\l 



•H eg 
iH eg 




eg 



^ ^ CO eg, 
O I O O eg 



CO 
ON 

o 



o 



8-A6 



given by Eq. [38]. Substituting = -.440 in this' equation, we get 
%^ = .5966 or 1.6761, 

ue less than unity, vils., .6^ =*.5966 

is the one we need. ^ * ^ • 

Having obtained this estimate, how can we tell whether it is 

a "good" one? » Urtlik^ in the case of^a deterministic model (such as a 

regressio^n equation), we , cannot verify the goodtiess of fit by*" computing 

estimated scores 'from the model equation and comparing (or correlating) 
i , - ... 

* . \ . ' 

them with the observed scotes, for the model equation confaifls the un- 



with tne ODservea 
rvable rantiom varic 



observable rantfom variable a . There are some complicated and indirect 
methods for checking the adeq^acy of the chosen model and 'estimated 
parameter(s) . (See, e.g.. Nelson, 1973, Section 5-11.) In our numerical 

A ' • • - ' 

,.^>^€(xaiDple it was decided, after various considerations, to use the follow- 
ing approach, which seemed sim{i>ler than existing techniques and acle- 



quate for our purpose* (It also has thd advantage of illustratS^ng, in ^ 
' its simplest form, the general method develgped by Glass e4: al.; 1975, 
^for estimating and tesjting intervention effects.) 

Suppos^^^e imagine a fictitious intervention between days 8 • 
and 9 such, that leads, to an immediate elevation of the "level" gf the 
system by a specified number of unit^, say 5 .points. The modi'fied plot, 
of group means, with all points from day 9^ on moved upwards by 5 units 
from their original positions in Figure 1, is shown in Figure 2, Of 
course, this constant elevation of scores will not affect the correla- 
tions among either the original z^'s or the first differences w^. Hence, 
\^ . ^ the estimate of 0^ will remain unchanged. We may then ask the following 
question: U^ing the previously estimated 0^ = .5966 in the technique 



\ 

\ 



ERIC .\ .... 2^.1 



5-47 




schobl days, with the last eight meanp artificially boosted by 
5 points. 



ERJC 



8-48 



for detecting an iji.tervention effect, will we be able to "retrieve" the 

«» i 

built-in change in level of +5 units? If so, we may be reasonably 
assured th&t, both the model chosen and the estimated parameter value 
must have" been adequate. ^ , 

Tfh6 appropriate instancy of the intervention-ef f ept 'estimation 
technique developed by*' Glass and his/coworkers , following Kepka^l? 
is as follows. Using fhe random^^shock fornToF the, IMA(1,1) model 
equation (i.e., Eq. [33f) w^th'k arbitrarily taken tor be 0,\we write 



tVi 

= L-^ +,(l-e^)' I a. + aj.*^-(t=l,2,..., 



8) 



as the structural equation for observations from day 1 through day 8. 

(Her^e '^observation*^ refers to the group mean for each day.) Then, after 

an intetvention between days 8 and 9 which ^s assumed to result in, a 

change of level by 6 unit, the structural equation will change to 

' t-1 ^ 
' 2 + + (1-9,)' I a. + a-. + 5 (t=9, .. .,16) 
^ t-l ^ ^ . 

from day 9 on. 

t 

The next step-ls to recursively define a sequence of traas- 
formed variables ^y^K follows: j 



' 1 



[42] 



^t = '(Wi) :^ Vt-i'i^-^ 2 



It can be sfiofcm'lthat the y thus defined *are expressible as linear 
functions of L^^, 9^^ an.(} a^. Namely,- 



-/ 

/ 



ERIC 



2 J, i 



3^L^ + 6 + 



8-49 



^10 = ^ih-*-¥;-io 



'16 



3j\ -H 9^6 -H a^^ 



•or, in matrix notatipn 



r 



[43] 



'10 



'16 



1 0 



el 



'16 



■tSt 



ERIC 



2J8 



which may symbolically be written 



y ?= JLB +■ a. 



where y and a are dbvioTis, X is the 16 x 2 matrix of successive pow.ejs 
of e, and O's, and 6 = t^f* 5] • ' ^ • 

Once the equation is cast in this form the standard least- 
squares estimate B of 6 for linear models may be computed as 



[44] ' 6.= (x'x)r(x'v) 



Here the vfector y is const^ided, in accordance with.Eqs. [42 
?fom t^e "observed'* sequence '{z^} (which are the group means plotted in 
Figure\2)', and 'the est±mate<J %^ ^ .5966 replacing 6^.. We ill-^strate 
the calculations in detail for the first; few elements of y* The observed 



z vectcfr is: 



= I13.56/l^.4-7,;i5>ai> ir-.eO,' .... 17.60, 17. 22, 17.47] 



4, 



Hence, the vector of first' diff erence^i is 



w = [13.56, -1-09, .'64, -1.51, 



Then, in accordance with Eq. [4*2], we' get 



, ->.53, -.38, /.25i^^^%\ 



V3 



y^=z^= 13.56 

Ta^^VV'-'Vi ■ ' ■ 

= -1.09 + (.5966)(13.56) =6.9999 



7 



y = .64 +.,(.5966) (6.9999) = 4.8161 • . 



^4 = 



= -1.51 + /. 5966) (4. 8161) = 1.3633, 



/ 



a' J:) 



8-51 



and so on,l The complete vector ^ is, with elements rounded to two* 
decimal pldces/ \ ^ > 

I ■ 

[13.^56, '7.00, 4.82, 1.36, 6.28, ,82, -1.64, 2.53 
'7.69, 1.21, .88, - 2.94, 1.37, .29, -.21, .12]' 

t 

Wlith this and the 16 x 2 matrix X with 6^ replaced by its 
estimate 6^ f= .5966, we may compute 3 in accordance with Eq. [44]. The 
result is 



A 



6 = 





4 

i 


13.2652 


6 

< > 




5.1276 



The estimated Value, 5.1276, -of 6 is seen to be: very close fo^the tpue . 

. . ^ , «i 

'value, 5.0, th^t we deliberately introduced into the system. Thus, 

have sovc^ evidence to supp6rt the propositioiift that the model chosen and 

the estimated parameter value ar^ adequate. This, in turn, suggests 

that the proposld method for estimating p. is a viable one. ^ , 

^ 'However, the skeptic .may feel^ in view 'of the artificial 

manner in which, an "intervention effect*' was intro'ciuced, that we merely 
Vgot out' what we put in," and the particular value of 8^ was immaterial. 

'To check If this cQuid have been the case, computations for 6 were re- 

peated with the values of 9^ used, in Eqs. [42.]-[44] systematically 

*fi ~ --- 

varied from ,10 through .90 in steps of .05. The results, abbreviated 

to' show the values of S only for every other 6^^ value used, were as 
f^lows:* 





.10 


.20 


,. 30 


.40 .50 .60 


.7.0 


.80 


.90 


6 


6.168 


6. 089 


5.934 


5.706 , 5.425 5.116 


4.812 


4.550 


4.372 



These, results effectively refute the hypothetical skeptic's contentioh. 
The value used for 9. does make, a difference in the -value obtaitved for 

I • ■ ■ ■ ■ 

6. And the value .5966 estimated by th^ proposed method comes close to 

\ ■ ' ' ' ^\ . 

being a optimal one. (By innetpolation in' the finer table, with 6, 
varied in steps of .05, the "best" value of 0 i^ found to 'be .6037, 
yielding 6 = 5.000 td three decimal places.) 

* At the same time, ^ however , we note that the obtained value * 
of 6 varies fairly slowly with 6^^. In other words, the estimation of 



6 seems to be fairly robust with respect to m\nor inaccuracies in the 



Vwe eaViic 



estimation of Q^^ This is the ground on which \we earlier asserted that 
further ref indent of parameter estimates by max\mum-likelihood methods 
seemed unnecessary, at least when th^ ma'in purpose is to" estimate the 
intervention effect. Of course;, one instance does not prove a 'general 
proposition, and* this assertion must remain a working hypothesis unless- 
and until it is ^confirmed by further research. 

Second' Example: Simulated Data 

In order to check the performance of the proposed method for 
a model of order higher than 1, simulated data following an AR(2) proc- 

9 

ess were generated as follows. , ' 

Taking (^t-^ - .6, ~ ^ ~ 3* in Eq. [21]', the par- 

i 

ticular AR(2) model used was 



with a^ generated by a random unit-normal generator and resc^^ed so 



that % = 4. One. hundred independelit sequences , 



^h' Jkvl • ^il6i' 




11 



8-53. 



/ 
/ 



were generated by use df the above equation, except for t = 1 and 2, 
for which ~ 



= 3 4- a^ 



and ' Z2 = 3 + •6z^ + 



were used since there are no observations prior to z^, 



The result vas a 100 x 16 data matrix Z, whose column means 

\ ■ ■ 

were as follows: \ 

I, - z o:\l8.71 '22.41 22.14 22.96 23.94- 24.79 24.90 24.94 

• 1 . ' • o \ « 

^ 9 ^ ^ 16- ^^'^^ ^^'^^ ^^'^^ ^^•'^^ ^^'^^ ^^"^-^ 28.28* 28.51 . 

r 

That these > show a monotone increase with t reflects the fact our choice ^ 

of ^ ^4) was, in retrospect, too' small relative to ({)- = .6, = .3 to 
a ' A. ^ 

I 

produce an oscillitating series in the short run of 16-Xime points. ' 
' This does Tiot, however, vitiate the results of further analysis. I 
, / . The correlation matrix based on this simulated data matrix 
l!s ishown in "P^le 4^,^along with estimated autocorrelations of lags 1--15, . 

I I . ^ ^ \, ~ J . . ' ' ~- • / 

calculated in acc<jt4ance, with Tlq. [41]. . . ' 



I'culated in acc<jt4ance, wit 

Now let us pteheriti we did not know that thes^ estimated auto- 

/ • / #1 / ' ■ " ' ' 

correlations were^>ased/6n simulated data fallowing a, particular process, 
land go thijough :he mo/ions of identifyi-ng an apprbprikte model knd , 
^estimating the oara^ter (s) . . first 'of aj.1, we observe thar there is 
no abrupt drop of the sample autocorrelations to near-zero; so an MA 

' ' ■ ■ '\ . 

process is ruled out. Next, Wjg^Jiote that thete is a steady and fairly ^ 
rapid declining t)f r_. with j-'^^j^like the^very gradual and irregular 
declining found in Table /. So a stationary ♦AR process of some order 



G 

cd 



cd 

Cd 
•a 

•a 

<u 

cd 



:3 



CO 
CO. 

<u 
o 
a 

O w 
O 

.H < 

Cd <D 

a 



o 



CO 4J 

cd • 

/•H tH 

I u o 

O -H 

-s 

Cd 

<u o 



cd 
H 



O 

o 



O ON 

<9 • 00 



o 
o 



00 
00 



00 



vo <r 00 
00 00 



. o 
o 



.00 . 00 



O 
O 



00 



00 



O 
O 



CO 00 



*' a^ O 



O CO ^ CO 



00 fH 



a^ 00 



o 
o 



CO 

00 



o 

CO 



to o 



00 a> CN 
vo m to in 



O 
O 



in o 00 CM 
00 00 vo vo 



to ro 
to to 



to 



o 



/ 



o 
o 



0^ . vO 



00 <r vo 



CO 



O 
en 



, O 0\ CN O CN OO^ON • -vT • in 00 

"Or^r^vOvotn<rcoromrofM 



^ a^ CO <f st ^ r*^ -tn <t r4 

Q vo KCf. anv^-roco cN cncncn 



O 
O 



CO 



00 



va to 



in <r 



o 



CO 



o 

CO 



fH m 
o 



CN 



ro . vo 
fH vo 



CN O 
fH 00 



CO 
CN 



CN' 
CN 



CO 

o 

M 

•J 

o 
a 

8 



CN 

as CO 

CO 



00 ON 
CO 



CN 



O • CN 
< vO O 

« in 



o 
m 



in 

CN 



m 

CO 00 



o 
o 



o o 



<0 



*fH onCo^ih CN a^ Or^ moo in 

vO'«^v^<r<}- COrOCNCNrHfH/H 



vOvOOOfOCNcNr^CNO 

"^<rco<r<rrocNCNcN 




ERJC 



Z i 3 



8-55 ' 



* ^ is feMggested (cf. Table* 1 for the behavior of autocorrelations for 
• " varrious processes) ; The question i^> what* order? ? 

Th6 rate of decline doe^ Tioi: seem quite as rapid as to suggest 
,AR(1), which^ shows an *exponentJLal depay of the p.. However, taking 
t'lje- successive ratios vjv.^^ (which should .all estimate ({> if an AR(1) 
tDodel is adopted), it seems barely possible that an AR(1) nSodel with» 
^ (J>^ ^ .90 might fit the data,^ .(We say 'barely possible" bec^se<the 
^value^.90 for estimated frbm tAe suecess^e ratios is consi^rably 
larger than ir. = .816, which should also be an estimate of' 4- if AR(1) 
i<6 in fact the -correct model.) We therefore need to look at the es- 
timated partial autdcorrelations to 'decide the issue. 

Sietting ^p = 2, the Yule-Walker equations (cf. Eq. [25])"with 
. and P2 replaced b^ r^ and v^^ respectively, are > ^ 



(j>^ + .816 ({>2 ^ .816 



.816 + . (j>^ ^ .760 
whose solutions are 



({>^ = .586 and = .^82. 



Clearly, (j>2 is not small enough to conclude that = 0. That is,^an 
^ AR(1) model is ruled out as inappropriate. 

Next, let us compute (f)^^ (=({)^) from the Yule-Walker equations 
* with p = 3; i.e.j 






8-56 



Using Cramer's rule, we get 



.816^^6 
.760 

.760 .816 .685 



,816 ^ 1 



'33 



1 .816 •.7'60 
.816 • 1 .816^ 
.760 .816 



.0033 
.1028 



= .032, 



which Is negligibly different from 0.. We tnay ^therefore conclude that 
AR(2j offers an adequate fit to the data. 



viously tomp 



Once S;€t^dopt* AR(2) , our estimates 6f (^^^nd (^^ pre- 
ut^d ^^i^o^lhe Yule-V 



-Walker eqii^iQns with p = 2; namely, 



•(^^ = .586 and <i>^ = :282.^r 



/ 

Abandoning our make-belief that we do\iot know tjh^ "genealogy" 



of our data the estimated values for (f) and (f)^ are quite close to the 
actual val^)^ .60 and .30, fhat were used to geh«rate ^Uie simulated 
data. We may^^l^erefore conclude that th^ proposed method^ for p^ramet 
estimation "works" for>>^cond-order processes as well a6"'the first. 



^r 



SUMMARY AND REMARKS 



/ 



> \ , .The bulk of .this chapter is admittedly exposipry in nature,- 
but It is' belieyed that the exposition was^ made in a moj're elementary 
manner than found in currently availably books on the subject — altt 
by the same token," the treatment was necessarily incomplete in s- 



tffechnical detail, 

.'I 



The one original pio^ntr ibutj^oij .made^ln t]iis phap^er 



■ . . 8-57 

! 

proposal of ar^ altemativ.e method for estimating autocorrelations of 

various lags — the key to model identification and parameter estimation 

in time-series^nalysis. This method i5 based on the ordinary sample 

correlation matrix which is computable whenever genuine longitudinal 

data are to be^ an^alyzed (L,e., when a single intact^ group has been 

observed at several time points) . The traditional .method f ot estimating 

autocorrelations (based on a single observation at each point in ^time,^' 

such as group means -on the several measurement occasions*)^, it was 

argued, 13 not appropriate for two .reasons. First, it ignores the 

correlatedness inherent in longitudinal data, just as though we were to 

use a randomiz-ed groups design ANOVA when a repeated — measures design 

is proper. Second, the traditional method requires such a long series 

of observatioijs in time (at least 50 observation^, according to Box and 

Jenkins, 1970) as is almost never available in longitudinal studies. 

The prcJposed method was put to a test by means of two 

numerical examples, /one based on real data and the other, on simulated^ 

data. The outcomes of these analyses aquately confirmed the "validity" 

\ • 

of ^ the proposed method. 

Directions for Future Research » ^ 

Obviously, further ^study of the efficacy of the propose^d 
method is needed; what^wds accomplished within the contract period" has^ 
only scratched the surface in this respect. One thing which urgently, 
needs to be done is to relax the assumption, inherent in the method 
as it stands,, that the parameters are identical for all individuals" in 
a' group. This clearly an unrealistic ^ assumption — although, in-one 
sense, an innocuous one. Whert this as3umptibn is untenable, what we 



n0 



8-58 

get as' parameter estimates are some sort of averages of the respective 

individual. parameters. However, it would be much more satisfactory if 

individual differences in the parameters could be explicitly considered/** 

For instance, by assuming some particular disttibution of each parameter 

over a population of individuals, the' autocorrelations could probably 

be related to the moments of this distribution. ^ ^ 

Another matter ,which requires further research is the method 

of estimating and testing intervention effects. "The techniques developed 

7 

by Glass, Willson and Gottman (1975) are. perfectly satisfactory in 

situations where there is but one observation per time point. But, 

somehow^ one feels that they are wasteful of information when applied / 

to data from genuine longitudinal studies. 

It is regrettable that the present researcher could make no 

inroads into the above-mentioned problems within the contrast period, 

mainly because he was a relative novice in the disciplihf of time-serie^ 

\ • ^ 

analysis at the outset of the period— a novice who was diss^a^tisf ied 
with certain aspects of the traditional methods of time-series analysis 
when they are sought to be applied tp longitudinal data. However, he 
intends to follow up this line of reseat^ in the future. 






REFERENCES 



0 



Anderson, T. W. Estimation o-f covariance matrices which are linear 

combinations, or whose, inverses are linear combinations, of given 
matrices. In Bosfe, R. ,C. et al. (eds.) Essays in probability and 
statistics . Chapel, Hill: University of North Carolina Press, 
1970. , , ; 

Box, G.E.P.. and Jenkins, G. M. Time-series analysis: Forecasting^ and 

control. San Francisco: HoXden-Day, 1970. 

Box, G.E.P^ and Tiao, G. C. A change in level of non-stationary 
time Series. Biometrika , 1965, 5^, 181-192. 

Campbell, D. T. Reforms as experiments. American Psychologist, 1969, 
Ik, 409-'429. ^ . ■ ■ • • 

Glass, G. V, WillscJh, V., L. and Gottman^ J. M. Design and analysis of 
. time-series experiments . Boulder, Colo.: Colorado Associated , 
University" Press, 1^75. • 

Kepka,^ E. J. Mod&l representation and the threat^ of instability in th 
interrupted time series quasi- experiment. Unpublished Ph.D. 
dissertation. Northwestern University, June 1972, 

Nelson, C. R. Applied time series analysis^ for managerial forecasti ng 
San Francisco : Holden-Day, 1973. 

Quenoui'ile, M. H. Approxima:te tests of correlation in tJ^e series. 
Journal of' the Royal Statistical Society (Series B) , 1949, 11, 
68-84. ■ 




218 




/ 



DHAPTER, 9 



'ESTIMATION OF TRUE dHANGE: , UPPER AND LOWJR BOUNDS 



I^^TRbDUCTION 




/ / 



'-i s 



Furby ( 
instead 



/' ' 




iti' Chapter this Report, Linn and Sliride have presented 



urA^ey the litetayure on the tome of meaisurement of char^^e and i 
, many ])roblems — seemingly insurmountable problens 'that led Crpnl)ach and 



970) to recpjnmend against 

/ 



the use of g^in scores, and advise 

that researchers " fr^me their questions in other ways." 

- Without discountirig the seriousne'ss of the problems surround-- 

ing the measurement of change, the present writers wish to propose 

that at least some of these problems can be traced to an unjustifiable 

assumption in classical test theory: that thfe error components of any 

pair of test scores ar^ uncorrelated. In this chapter we explore new 

^ vistas that may be opened if the assumption of "universally uncorrelated* 

measurement. errors" is dropped. The dropping of this assumption, how- 

ever, leads to mathematical problems that are insurmountable unless 

techniques hitherto not utilized in test theory — in particular, operator 

analysis — are introduced. This approach, pioneered in the first 

author's recent doctoral disserts^tion (K, Tatsuoka, 1975), is used in 

* * 

this chapter. 

. ' ' NOTATION AND DEFINITIONS 



ERIC 



By and large, the notation used in this chapter follows 
that of Lord and Novick (1968), but there are some peculiarities. 
So we 3et foi^th a complete 'notational guide in this section, even 
though ma^y of the symbols are in universal use and need no explanation. 



9-2 



'y All lower-case Roraan letters (except those u^ed as subscripts 



^and/St^perscripts) stan4'^ for p|^rson-space vectors in^ de' 

' —J ^ • / 

/rescaled by the^ f actl^ l//N-i, where N is^ the sampl^ s 



/■iation form, 



is the N-vector whose 
a sample of N persons 



ize. Thus, e.g., 



> . . . , 



elements are the deviation scores on- test X for 



letters either stand 
tests (like X and Y) 



, e^ch divid,ed by v4l - 1. 

J i! ' * • ' 

All Greek letters stand for scaljars, while upper-case Roman 

i 

forlsQalars (like N) pr are generic symbol^ for 



or other random variables . 

'1. i 



An 'immediatie cd,[isequence of the ^bove definition of the test 
vector x^is that its squared norm (i.e., the scalar product of x with 
itself) represents the variance of test X: 

: N ■/x -x\2 l(^ -xf , 

(x,x) = ||xll2= I -V 

Similarly, the scalar pro(iiict:'''between two different test vectors x and 
y represents the cov'ariance between tests X and Y: 



i=i A v4j-i/ 



N - 1 



= a(x,y) . 



f 

Note that (x,y) is used^ instead of the more customary x y for a scalar 
product. This is" because we will never have occasion to used the. 

matrix product xy of two ve'ctqrs, and scalar products will mostly -occur 

^ 1 ' 

as coefficients iri, a linear combination of vectors so it is convenient 

I 

to set thei^i apart With parentheses. 

' ^ 1 ' 



. In this dotation the s 
Y on X, whose usual formula is 



Lmple regression coefficient b^^ of 



_^,Nx- j^2; Var(X) ' 



becomes. 



b = or si^iply 



(Which further Reduces to 



= o(x,y)^ or* simply (x,y) ',. 



when* X is of' unit norm (i.e., |pxj| = 1) 
occur in the sequel. Also, the correl^ 



r =7 



Cov(X,Y) 



This- form will repeatedly 
-ion coefficient ' 



Jl^Jly^ /Var(X) Aar(t) 



becomes 



they represent (r^ = 0).! We shall often us 



' ^ aCx^vV ^ _ (x,y) - 

i' M \\\r\\ 11x11 llylf . 
Hence^ orthogonality of two vectors, x and y [i.e., (x,y) = 0] is 
synonymbfis with t*he uncarrelatedness of the two tests X and Y which 

the terms "orthogonal" 

and "ujicorrelated" interchangeably — even tho 
the former is a geometric propjerty of two vei: 
is a statistical property of the two tests represented by- the vectors. 

The component of a vector y in the direction of another 
T/ec|tor' X is given 'by 

(y,x)/||x|I, or simply (y,x) if ||xll = 1. 



igh, strictly speaking, 



tors while th 



e latter 



[This follows from the cosine law, 



•(x,y) = ||y|| COS 0, 



Hi 



(where 9 is the- angle between the .vectors x and y) and the fact., 
verifiable by elementary geometry^, that the component in^guestion is 
||y|| cos 9.]" 

The pro jection faore precisely orthogonaJ) of a vector y onto 
.vector X is a vector whose norm (length) is equal to the component of 
y'iti.the direction x, and whose direction is that of x. In other words, 
it is the component (as defined above) multiplied by the unit vector 
in the direction of x; i.e., j 



T> . / 1 \ (v>x) X (v,x) 
Pro, (yjx) =f^-l]^=jj^- 



Not^.that the coefficient of x l\^re is ^precisely the regression coef- 
fieient b of y on x, defined earliet.|- Thiis, the projecl;.ion of y on x 



it: 



•is the same thing as the regression ofl 'test Y on test K, and may be 
lienoted 



9 = R(y|x) =-^ x: . ' • 

. ' . 11x11' . _ ' . 

This interpretation of regression as- the outcome of applying the "pro- 
jection operator" to a vector is what* enables us to -utilize the various 
theorems and techniques of op era tor*. analysis alluded to in the Intro- 
duction. , i . ^ 

The multiple regression of test Y on tests X^, X^,**., X^ is' 

denoted by 

y » R(y|xp x^, X ). 

Geometrically y corresponds to the projection of y onto the space 
spanned by x^^, y.^^^ x^.^^ J ' 

Finally, two symbols which probably need no explanation are: 



9-5 



= ,reliabifLity of test, 



lp(x,y) = ,cotrelation between X and Y, 



ESTIMATING TRUE CHANGE FROM PRE- AND POST-TEST SCORES 



The multiple regression equation for estimating T2 ~ T 



from 



the observed pre- and post- test scores, and X^^ may be written as 

However, it is more convenient to use as predictors a pair of uncorrelated 
variables (such as the principal components, for example) instead the 
original and X2 themselves. A further convenience is to have the ^ 
derived predictor variables standardized so their vectors will be of 
unit norm. It is well-known that multiple 'regression is invariant of 
any nonsingular linear transformation of the predictor variables; i.e., 
if the derived predictors are linear combinations of the original pre- 
dictors-^uch tKat the coefficient determinant is non-zero, then using 
the multiple regression equation with the transformed pr^edictors will 
yield predictions identical to those using the original m\iltiple re- 
gression equatio\i. For example, if the original predictors are X^ and 
X^, a new pair of ^pi^fedictors Y- and Y defined by 



^2 = ^2A ^22^2' 



will leave the predictions unchanged so long as 



ERIC 



2 J.j 



\ • 




^11 ^12 



Y Y 
. 21 22 



^ 0. 




/ 

9-6 



*%or the above reasons, we propose -to replace Eq. [1] by an 
jij/alent multiple regression equation using a pair of uncorrelated, 
unit-norm vectors ^^^.'^Z^ (mathematically known as an orthonormal base 
of the space spanned by and x^) as the, predictors, ° i. e. , 

c 

[2] .^^^^1 = R(t2-tJc^,C2), 



where the exact nature of. c^^ and c^Ci^e., how they are derived from 
and x^) is to .be specified later • Since c and c^ are uncorfelated and 
have unit norms (i.e^, the standard deviations of C^^ and C2 are unity), 
Eq. [2] may further be rewritten, successively, as ^ 



13] 



^2 ^ ^1 " ^^^2"h'^l^^l <^(t2-ti>C2)c2 ^ 



= [a(t2,c^) - a(t^,c^)]c-^ + [a(t2,c2) - ^(t;L'^2^ ^^2 



[The first step follows from the facts that, when the predictors are 
uncorrelated, the partial regression coefficients are the same as the 
simple,. regression coefficients, and that c^^ and C2 are of unit norm — 



see Section 2. The second step follows from tiTe fact that the covariance 
of the difference between tiwo variables with a third equals the dif- 
ference between their respective covariances with the third variable: 
Cov(A-B,C) = Cov(A,C) - Cov(B,C) . ] . 

From the last member qf Kq. [3] it is apparent that, in order ^ 
to be able to use Eq. [2] in practi'ce, . we must know (i*e., be able to 
calculate) • • 



ERIC 



2^4 



a(t^,c^), a{t^yc^)y a(t2,c^) and a{t^,t^). 

Recalling that an(f are to be defined as linear combinations of 
and i. e. , . • 

t 

^ "^i"'^il^l-'^2 ^2 ^^=^'2), 
it follows that 



a(t.,c.) = a(t.,a.,x-+a_x-) 
J 1 ^ J il 1 i2 2 

. =^('^j'^ii^)--*-<'('^j'V^P 

= ctiiCf(t^.x^) + a^2<='(t^',X2)(i=i,2; j=l,2). 

Therefore, to use Eq. [2]^we myst know ^ 

a(t^,x^), a(t^,X2), oit^yK^) and oCt^jX^). 

Of these, however, we already know the like^subscripted covariances, 
a(t^,x^) and a(t2,X2); i.e., 

7 2 
[4] a(t^,x^) = llx^ll and a(t2,X2) = Hx^H 

where and are the reliabilities of the pretest X^^ and posttest 
respectively.^ ^'^k 



^Each of Eqs. [4]' may be derived as follows: 
• p,, = p(x,t) = [ \ 



0 0 
X t 

a(x,t> = a^a^/p^. 

But >^ = — , so - a vp 

a t XX 

X 



a(x,t) f 0'(q Jq^)/p^ = 0^ p . 

* x\ XXX X X 



9-8 

Htoce, we need only show how to find' the corss-subscripted covariances, 
aCt^jX^) and a(t2,x^). i 

It turns out that these connot be determined exactly, but thei 
upper and lower bounds can be computed. Toward this end, we. first 
discuss some mathematical preliminaries. 

BOUNDS FOR (t^,x^) WHERE' (i 4^ j) 

A powerful mathematical tool for obtaining bounds on scalar 
products of the sort we are interested in is Bessel's Inequality: 
Given an orthonormal set {a-, a2, a } (i.e* , a 

set of mutually orthogonal vectors all of unit norm) 



and any vector y, it is true th^t 



2 . n u2 



[5] I Cy,a,)^ ^ ||y||'' 

i=l 

It may be noted that, in any finite dimensional sp^ace, this inequality ^ 
follows readily from the Pythagorean theorem. The equal sign holds 
when'V'is t^he dimensionality of the space in which y lies (i^e*, when 
{a^^, a^y •••> is a complete orthonormal set, or an orthonormal base 
of the space), for the sum on the left is then the s,um of the squares - 
of t)ie components of y along all of the orthogonal axes. If V is less 
than the <3iiS4nsionality of the space>. the left-hand sum will lack the 
squares of some of the .components of y, and hence the ''less than" sign 
may hold. (We cannot say that t}ie "less than" sign necessarily holds, 
becaXis^ the components whose squares are missing may happen to be zero 
anyway.) The reason why inequality [5] is given a celebrated name is 
that Bessel proved it to hold even for a vector space of infinite 
dimensionality (i.e.,*a Hilbert space), in which case V itself may be^ 



9-9 

infinite and yet {a^, a^^ ...} may fail to be a complete orthonormal 
set. , " c 



For our particular application, we choose the orthonormal set 

\ ' ' ' (N-2) - 

{a^, a^,\..., a }' as follows: Let^ x , x , x be the ob- 

1 . Z V \ X X X 

served^-score vectors of N - 2 parall^el "tests of.X^, and e^, e^ , 
e^^ be the corresponding error-score vectors. Then, since the error 
components of any two parallel tests are by definition uncorrelat<ed, it 
follows^ that 

. f^^llef II. Vlleji. e;'/||e;'||, 1, ef-^)/||ef-«||) 

is an oft^onormal set comprising N - 1 vectors (one less than the total 

" (0) 
dimensionality^ N, of our space). Here e^ is the error-score vecDor 

of Itself,' the superscript '(0)* being added for consistency of 

notation. 

Using this particular orthonormal set as the {a^^, 
a^} in Bessel's inequality [5], we get ^ 

i*0 

Now, 'from the definition of ^reliability , we know that 

for all i = 0, 1, 2, . .. , N - 2. Therefore [6J becomes 

i=0 -. , 

2 ^ 

or, uppn factoring out d/llxj^H (1-p^ from the summation on the left and 
dividing by it on both sides, 

"[7] ' Y (y.e[''^'^ l|y||' ir^ll'dTPi). ■ . - 

• i=0 . " 



9-10 

This relation, as it stands, is clearly intractable. We, 
therefore introduce a simplifying assumption: that the error component 
of each of several parallel tests has the same covariance with the 
error component of a given external test, or th'e assumption of " homo- 
geneity of error coveriances " for parallel measures with another test, 
for brevity* Symbolically, we assume 

{S] ^ 0{e^,e[^h = 0ie^,e[^h = ... = aCe^.e^J'^)) E a{e^,e^), say. 

This assumption is not as far-fetched as it may seem at first glance, 

for it merely requires that the observed- score covariances between Y 

and^each of X^^, X^^, X^^^^^ are all equal. ^ Furthermore, a(y,xp = 

-aty^x^) = together- with the assumption that 0 = O f = 

^ \ 1 ' 1 . 

(since X^ , X , .1. are parallel^measures) , implies and- is implied by 



2 



^his may be seen as follows: . 

[because any observed score is, by defirfltion, equal to the sum of the 
true score and the ^rror score, and since 'x^ and x^ haye the same tru 
score component]^ , * ' ' 



e- 



[since a(t ,e^) = a(t ,ej^) = 0] 



2 8 



' . . - 9-11 

p(y'',x-),= p(y,xp .... thus, the homogeneity of error covariances 
assumption's] is seen to ^be equivalent to assuming that all. members of 
a set of parrallel tests correlate equally with a given external test, 
which'^seems to be a reasonable assumption. • ^ 

, It should be noted that [8] represents a liberalization olf 
the traditional assumption in classical test theoiry, in that [8] merely 
states that the N - 1 error covariances are equal while .the traditional 
assumption requires that these covariances all be equal to gero (the |^ 
"universally uncarrelated measurement errors" assumption) other 
words, the traditional assumption is a special case of [8], vj^th 
a(e^,ei) = 0. 

When we introduce Eqs. [8] into inequality [7], the summands 
on the left all become equal, and the sum reduces to (N-1) (^y»^-]^)' 
^Hence, inequality [7] reduces to ^ 

[9] (y.e^)2^ l|y||2 V=T- 

liote, incidentally, that this implies that if = 1 or N ^ <», 
(y,e^) = 0 — in agreement wit hi the traditional assumption. It is clear, 
•however, that the "homogeneity of error covariances" assumption [8] is 
incompatibl'e with letting N for then phe infinite series on the 

left-hand side of inequality [7] must diverge (s?ince it is the sum of 
an infinite number of constant positive terms) and cannot be^ bounded. 
We therefore exclude the possibility that N <», and conclude that the 
only condition under which [9] leads to the classical assumption, 
(y,e^) = 0, is when = 1. That is, within the realm of perfectly 
reliable tests > the error components of any two te^ts are always • - 



uncorrelated — which ig triyiaxly true since the error--^ ^ores '^re con- 
stantly equal to zero anyway. 

Next, from the definition . > 

= t + e . ' V ■ . / 

it follows that • , 



.^1 = \ - ri 



and hence that 



J^y^i^ = (y,x^) - (y,t^). 



Substituting this- in [9], 'we get 



or 



-l|y|l • ll^iy^'- (y.-i) 



< (y.t,) ^ ||y|| • ll-illi^^T' 



whence 



[10] (y.x^f - ||y|| . (y.t^) ^ (y.x^) + ||y|| ;. ||xj 

Note, again that if p^^ = 1, this yields i 

(y,t^) = (y,x^)„ 

< ^ 

which is the classical test-theory result under the assumptipn of un- 

1. > 

-correlated errors of measurement for any pair of tests. \^ 

Now* recalling that y,was an arbitrary test^ vector (other than 
one of the parallel measures of x^), we may let y = X2, the post-test* 



N-l 



230 



9-13 



vector. In this instance [10] becomes 

\< ' r— - • " 

(lla]!i (X ,x ) - l|x H r ||x I| < it^,K^) $ ix^,^^) 



andj similarly, -by interchanging the roles of x- and x^, we get 
[lib] ^i^,X2) - llxjl - (t2,x.) ^ (x^,X2) 

" ^ Ilxjl • llx^ll j^. . ' 

Thus, we have established upper am\lower bounds for a(t^,X2) 
and a(t^,x-), the cross-subscripted covariances which were all that 
reinained to be known in order to be able to use Eq. [2] in practice. It 
is true that we have nat df tjbrmined these covariances exactly (which 



seems impossible to 



do in principle), and hence an exact estimate of, 



t« - t is infeasibie. However, by suitable substitutions of the upper 
2 <• 1 

and lower bounds of a(t^,x^) — depending on whether they appear with a 
positive or negative sign in the regression equation after c^^ and C2 
have been specified — we are able to obtain upper and lower bounds for 

; 2 1 ^ 

A computer program for implementing the foregoing developments 

* 

is being written, but .it could hot be completed within fhe contract 
period — mainly because it seeks to permit a larger set of predictor - 
ya^iafcles than just {xj^,X2} in estimating t^ - t^^. For it stands to 
reason (as, indeed, Cronbach and Furby, 1970, - have suggested) that the , 
more predictors — including demographic variables — we employ, the better 
will bfe the accuracy with which we can estimate - 

As this point, we can only present compbt;ed results for a 




n " ■ ■ -hi - 1-\ 




lower bound of the accuracy of the estima'te C2 - t^, to which we address 



ourselves in the next. section 



f ^ , ACCURACY OF ESTIMATE - \ 



The/accuracy of any estimate made by multiple regression may<= 
be ^uaged by thejmultiple correlation coef f ic^Lent. In the presfetit 
context, we wish to calculate p(t2--t^, ^2^^-^ y where - ia defined 
by Eq. [2], However, since its exact value cannot be determined in 
principle, we mus£ be satisfied with finding a lower bound for 

pCt^-t^, tfV-' , . \ , 

l^t is well-known that, when the predictor 'variables are lin- 
correlated, the squared multliple -R is the sum of the squares of'* the 
zero-order correlations between the several predictors and th^ criterion. 
For the case at hand, we have , , , • 

or, since c, and- c„ are of- unit norm besides being orthogonal (uncor- , 

related!^, . . . • 

■ ■ ' ' 2 1 ' 

, ✓^v cr (t„-t ,c ) a (t,-t ,c„) 

Here ^^^^^2^ o.rthonorm^l base of the space spanned by, x^ and 

X2. It is natural to take as c^ the unit vector ,in the ^^ ection y.^^ - x^ 
(since we are estimating t^ - t^) , whereupon is the unit vectoi^ 
orthogonal to' X2 - in tKe plane defined by x^ and X2. This brocedure 
for cons truc?ting an orthonormal base is called the Gram-Schmidt| pro'- 
cedure (see, e-.g., Rao% 1968). The results are 



232 



9-15 



With thi^ special choice c- and c^ (recall that, aay non^singulax, " 
linear transformation of , and X2 will leave 'the multiple regression, 



and hence' also the multiple corfela^t'ioh coefficient?, invariant), 't;he , 
' ^ ' ' ' ^ ^^^^ 

tvo tBrms^^jmhe xigHf^hand si,de of Eq. [12^ acguire the , following , 



interpretations : 

/ 



First term = reliability^ ^ of 7 \ , 

Second teriii = squared correlation tetween T^--.Tt anli the . 



residualiz;'ed post-test score, partialling out 



^Because^ by definition, 



p _ =V_ 

^2 ^1 ^2 ^2 ^1 



^2 ^i' ^2 ^1^ 




2 3.] 



Since ^nd C2 are linear combinations of and the 
numerators of the fractions on the right-hand side of [12] are quadratic 
functions of a(t^,x^), a(t^.,X2), o(t^yX^) y o(t^yX^) i of which the like- 
subscripted covarianoes are, as mentioned earlier,* known exactly, and 
we have obtained «upper and lower bounds for the cross-subscripted co- 
variances as inequalities [11a] and [lib] above. Hence, lower bounds 
of the'se numerator expressions may be calculated by substituting the 
flower, or upper bounds of (t^yX^) and (t2,x^) — d%>ending on the signs 
Vfith whiqh they occur* - ' ' " \ 

* , The denominator expression (common to both fractions) does 
not immediately appear to be related to (t^, x^) an(r^(t2>x^) > but a 

little," Algebraic mani^lation reveals that it actually is related to 

• * * ^ • ./ - — -» 

them» To wit, ' ^ - ' " , 

;U3] [ft^-tjl^^ (t:2-ti'>2-V^ . . 

' "="i|t2ll'+ ||t^tl'-2(Vt2) 

the first two of. the three terms of the last expression being directly 

^ . * • Y 

observable* ' But 



(t^^t^) = (t^,X2-e^) 



= (t^,X2), since (t^,e^) = 0, 
.Similarly, ^ . , ^ - - 



9-17 



2 

To get\a lower bound for p (^2^^^, ^2-^^), we need an upper 

2 

boui>d of the denominator ll^2"^lll'' henceva lover , bound of (tj^,t2), 

• 2 
fo^r this occurs with a negative sign in expression [13] for 11^2^^111 ' 

* Since (tj^,t2) is equivalent ly equal to i^y'^^^ to {vi^ytr^s^^^ s^hown 

abpve' (but not equal to {yi^yi^ unless the "universally uncorrelated 
measurement errors" assumption is invoked^ , we mUst use min {£,b, (tj^,;x2) , 
£.b.(x^,t^)} — i.e., the smallei? of the lower* bounds of (tj^,X2) and 
^^1*^2^ — to replace (tj^,t2) in expression [13], « ^ . 

The foregoing, completes our outline of hoV a lower bound of 
p^(t^^^, t2~^i^ computed. Details of the* computation are carried" 

out by a computer program.^ We no^; turn to a numerical example utilizing 
real data. This example not only illustrates the actual calculations 
for the above dgvelopments, but shows how we may introduce other pre- 
dictors besides the pre- and post-tests themselves in order to increase 
the accuracy of estimating ^2^^~^2.* 

' NUMERICAL EXAMPLE Z 

The data for this example ^re from an unpublished study by _ 
Misselt (1973), in which (among other things) the Metropolitan Achieve- 

" ment Test battery was administered to^ a large group Vf third graders in 
the Champaign, Illinois school district in the schoo*T> year 1971-72. 

- The group was retested in 1972-73 as fourth graders. M)nly the Reading 
test in the battery is consider-ed below, 'and only the scores for 624 



ERIC 



**Available on request from the authofsT This program accommodates 
three other variables besides the pre- and post-tests themselves-. 

. ■ 23.} 



9-18 

pupils who took the test both in 1971--72 ("pretest") and in 1972-73 

("po.sttest") are utilized* Besides the pretest and posttest scores'^ 

reading, IQ scores were available for these pupils, so IQ was used as a 

third variable in the compti tat ions that follow. 

We therefote extend Eq. [2] to ' * , 

» » 

[14] t^t^ = RCt^-t^lc^ic^.c^) 

= (t2-t^,c^)c^ + (t2-t^,C2)c2 + (t2-tj,c\j)c2, 
where c^, C2 and are constructed by the Gram-Schmidt procedure as 

c^ = (x2-Xj^)/I|x^tX^|| 

^2 ^ {x2-(x2,c^)c^}/||x2-(x2,c^)c^|| 

. •• \ ' . ^ ^ 

- Eq, [12], for the squared multiple correlation, f^^Ct^"^!* ^2^^1^-' * 
accordingly generalized to ' * 



[15] P 



2^ , ^^t2-^l»^l^ Vr^2> a^(t2-t^C3) 

.(t^-t, ,t--tj = :5 + ? — 2 — 

/-2_1> 2 1 „^^_^_^|j2 .|>y,J|2 

Summary statistics for the^hfee tests and some intermediate 
results necessary for calculating p^(6'^--t^, ^2~^1^ ^hen the assumption 
p(e^,e2) = 0 is invoked, and its lower bound when this assumption is 

not used, are shown in Table 1. ^ * 

ft 



/ ■■ 



9-19 



Table 1. Intermediate results needed for calculating p (.^2~^V ^2~^1^ 







Mean 


s.d. 


Pi 


Reading Pretest 




27.82 


—I — 
10.92 


.95 


Reading Posttest 


(X2) 


35.12 


12-.41 


-.95 


IQ 


(XJ 


104.24 


18.75 





^T^(N = 624)' 



.00895 
,00895 



Co 



variance matrix for X^^, X^: 



119.19 
113.62 
137.76 : 



- 113.02 
153.90 
163.67 



137.76 
163.67 
351.51 



/ 



4 



The covariances (t^.c^); [j=l, 2;i=l,2, 3i, 
'(e^,e2) = 0: - 


under the assumption that 




> 


.0304 10.3876 


.2391 








.-4.8(384 ' 13.4378 


-2. 2270^ 


i' 




Normalizing divisors for c^, c^i 




= Ilx2-x^|| = 6.8592 




/• 






K2 = |Ix2-(x2,cpcJ| =10.8801 








If 


= I|x2r(x2,-Cj^)c^-(X2,C2)C2|| 


= 12.9970 ■ 







Based on the intermediate results displayed jln Table 1, we 
first calculate the bounds f(^r a(t ,X2) and a(t2,X-), and note' that when 
•the assumption a(e^, e2) ~ 0 (an instance of the "universally uncorrelated 
Ineasurement errors" of classical test theory) is invoked, 



237 



From inequality [11a] we gef ' 

11J.02 - (10.92) (12. 41) (.00895) < (t^,X2) <^113b.O?^ 
+ (10; 92) (12.41) (.00895) 



or 



111.81 < (t^,X2) ^ 114.23 



when the traditional assumption ^(^^>^2^ = 0 is not invoked. Whereas 



Oit^.x^) =-a(x^,X2) = 113.02 



Vfhen we asstime (^2.'*^2^ = O*. * 

In this numerical example, since p- ='p^ (=*95), the bounds fot 
a(t2,'x^) are exactly the sane as those for a(t^,X2), as is evident by^ 
compOTing inequalities [11a] and [lib]. This will noC be true in 
general, when P- ^ p^. Of course, under the classical assumption that 
Oi^e^ye^) = 0, Oit^y^c^^ and a(t2,x^) are always the same, both being 

equal to a(x^,X2)- ' . 

Before -calculating the lower .bound* fo/ p'(t^--t^^, t2-tj^)^^^der 
the-liberalized assumption of "homogenel^ty of error covariarices" for 



parallel measures, let ^s calculate the exact value of ^(t^-t^, ^2''^l^ 
which the classical assumption of ^universally uncorrelated jneasureme^t^ 
errors purports to enable us get. Note that, under this aslKimptioi^, 
^ the common denominator of the fractions on .the right-hand side of Eq. | 
[15] can be exactly computed from Eq [13]: - . . ^ -i 




238 



9-21 

2 • I, ii2 . n ||2 



= (153.90) (.95) +, (119.19) (.95) - (2) (113.02) 



= 33.3955. 



Then, using the intermediate results displayed in Table 1, we get the 

following values for the three terms on the right-hand side of Eq. [ISy,"^—^ 

- 2 ' . " 

whose sum should equal p (^2~^1' '■2~*'l^ *' 

• *■ » 

t 

^ First term (reliability of X2 - = .7098 ^ 

Second term = .2781 

Third term ■ ~ .1821 

TOTAL 1-1705 

This result is, of course, absurd since P^(tpt^, ^2^^1^ cannot exceed 

unity. This is but one instance of the various difficulties that arise 

from the traditional assumption of universally uncorrelated measuremeaf-.-,*^^ 

errors. (See K, Tatsuoka, 1975, for other examples.) 

We now turn 'to the calculation of a lower bound for 

D(t --t • t^-t^) upder the liberalized assumption of homogeneity of error 
^ 2 1 

covariances for parallel measures. Table 2 shows the inte;rmediate 
results necessary for this purpose, in addition to or in lieu of the 

values displayed in Table 1. ^ A 

\ ^ 

table 2. Intermediate results needed calculating a lower bound for 

p(t^^y absence of the assumption p(e ,62) = 0 



Lower and upper bounds for a(t.,c^: 
_> J ■ 

-.1466 10.1793 .1911 . .207lti} 10.5960 .2871 > 

' ' - <^ - , ■ 

-5.0151 13.3410 -3.049i -4 . 6615 13.5347- -1.40 50. 



9-22 



Table 2 (Continued) 



Lower- and 


upper bounds for 


It^-t^ll^ from Eq. [13]. and the bounds for 








f 


30.9687 :S I|t2-t3^| 


1^ ^ 35.8202 




Based on these intermediate results, we find the lower-bound 
values of the three terms on the right-hand side of Eq. [15] to b^^ — , 
First term (reliability of X2 - X^^) ^ •56S 
Second term 

Third term ^ 

p^(tpt-^, Vt^) ^ .8538 

Hence, a lower bound of the multiple correlation p(t2-t^, ^'i^{}'> ^ 
measure of the accuracy of estimating t2 - .t^ by the method proposed 
. in this chapter is, - ^ •* 

^ "^^ 778538 = .9223. 

. . ■ SUMMARY 

^ A v«&tor-geometric knd operator-analytic approach to deriva- 

• tions_3n$L-proofr'in test theory, first explored by K. Tatsuoka^in her ^ 
diss^eiration (1975), was applied^in this chapter to the problem of 
estimating th^ true change from f\re- to post-tests. One advantage of 
this approach i^ that it readers feasible hitherto intractable mathemat- 
^ leal develb^pments in the absence of the traditional simplifying assump- 
. - ,tion that error scores^'are universally uncorrelated. 

That this assumption is inadmissible as an universal postulate 



has^been argued—wijth examples of "paradoxes" to which it leads-^by 
K. Tatsuoka (1975). Linn. and Slinde have also pointed out, in Chapter 
4 of this Report, that— ccpecially in the case when pre- and post-tests 
are unde?: consideration — the assumption of uncorrel^ted errors is un- 
justifiable. ' * , ' ^ ' 

Upper and lower bounds for estimated true change were developed 
vrLthout the uncorrelated errors assumption, but with tHe less restrictive 
assumption that the error covariances of a set of parallel tests with 
an external variable are all equal (the "homogeneity oF error covariances.' 
'assumption.) In addition, a lower bound for the multiple correlation. 
p(t -t , t -t ) between estimated true change and actual change was 
derived. It was also noted that, under the traditional uncorrelated^^ 
errors assumption, not only a lower bound, but the actual correlation 
value, could be computed. When this was done for the numerical example 
(usilig real data), however, a value exceeding unity was found— thus ^ 
providing another piece of evidence of the inadmissibility of the 
universap.y uncorrelated errors assumptions^ With the relaxed as^ 
sumption, a reasonable, and usefnal lower bound (.9223) was obtained. 



X . ■ • • . . 



9-24 



/ • ■ ' • REFERENCES 

Cronbach, L. J. , & Furby, L. How we should measure "change"— or should 
we? Psychological Bulletin , 1970, 74, 68-8Q: See also Errata, 
Ibid -, 1970, 74, 218. 

Lord, F. M.; & Novick, M. R. Statistical theories of mental test scores - 
■Reading, Mass.: Addison-Wesley, 1968. 

Misselt, L. A. An analysis at achievement level and achievement gains 
in the • Champaign public schools. Unpublished paper. University 
of Illinois, 1973. ' ' * 

Rao, C. R. Linear statistical inference and its applications. New 
York: John Wiley, 1968. 

Tatsuoli, K. K. Vector-geometric and Hilbert-space reformulations of 

classical test theory. Unpublished doctoral dissertation. University 
of Illinois at Urtana-Champaign, 1975. 



\{ ■ ' .-■/■.■-■ 

APPENDIX A % 
COMPARABLE READING TEST SCORES: A REVIEW 
OF THE AI^CHOR TEST STUDY- 



^ Biaribhini, J. C. & Loret, f. G. ^Anchor Test Study; Final Report. 
Rfeport and Volumes 1 though' 30," available as ERIC Documents 
ED 092 '601 through Ep 092 631. ' 

Bianchini,' J. C. & Loret, P. G. Anchor Test Study Supplement Final 
Report . Volumes 31 through 33 , available as ERIC Documents. 
ED 092^ 632 through ED 092 634. ^ ^ ^ 



V 



ERIC 



The prospect of reviewing the mammoth report of the Anchor Test 
Study (ATS) initially struck me as an overwhelming task. With the 
limited space in my office it""wbuld have been easy to refuse the 
request to review the ATS had it not b^enfor the availability of 
microfiche. Although I haven't seen it in that foxin, hard copy of 
the 34 volumes of the final report requires about 8-1/2 feet of 
shelf space (Loret, 1974). An acquisition of that magnitude would 
re<}uire me to part with more of those dusty "should^read sometime" 
items on my shelves than my conscience would allow. For better or 
worse, however, modern technology which made possible the production 
of the over 15,000 page report containing more than 8,000 computer pro- 
duced tables and graphs in the -first place also deprived me of my 
^ best alibi by reducing the teport to a microfiche file that is only 
2-3/4 inches thick, . ' - 

Fortunately the task of reviewing the ATS for this journal was 
greatly simplified by the fact that a very good review of the ATS 
has already appeared in another NCllE publication. The summer 1973 
issue of Measurement in Education was devoted to a description of the 

'study (Jaeger, 1973). Jaeger's description appeat^ed more than a year 
before the full report was released and before the supplement study 
involving Vn eighth test was available. In addition to having 
directed tlrfe development of study specification, he had available at 
that time, ^all but the three volumes that comprise the supplement 
report. —Indeed -the 31-volume final report of the original study 

^ was delivered to'USOE in December, 1972. The delay of almost two 
years between delivery of the report and its rel'ease is unfortunate 
because the value of norms certainly does not improve with age. 

Jaeger's description of the ATS provides^ good review of the 
history of the study, the planning and conduct of the study as well 
as the major outcomes of the study.* A more recent overview of the * 
study has been provided by the project director, Peter Loret (1974). 
Due to the availability of these two descriptions of .the study I 
^11 try to keep^my comments^ about the history and study procedures 
*"0»elat^vely brief. * ' \ 

•• - \ \ ^ 

Z43 ; . 



A-2 



OBJECTIVES AITO BACKGROUND OF -THE STUDt 



•"The Anchor Test Study had two major objectives: to provide a 
method by which one may translate a child* s score on ^y 'one oi 
seven widely used standardized reading tests into a score on any of 
the other tests, and to provide new nationally representative norms 
for each of these seven tests" ( ATS , Pinal 'Report , groject Report , 
p. 1). This was subsequently expanded to eight tests but otherwise^ 
this cpncise statement of obj'ectives needs no revision. Certainly 
there were other lesser objectives such^as the empirical inyestigar 
tions of different equating techniques, and obtaijiing interporrela- 
tion^ among the various tests, but these are mii^r In comparison to 
the two major objectives. 




ERIC 



As noted by Jaeget (1973) and by Lov^J^ (1974) the concerns that 
led to the ATS have a long history. D'u^l concerns about the adeqtiacy 
pf national norms provided by test publisliers and the desirability of, 
being able to compare scores obtained on one test with tho^e obtained 
on another have been with us for a long time (see for example Cureton, 
1941; Lennon, 1964b). 

The differences in sampling procedures that have been used by . 
different publishers were clearly documented by Lennon (1964b). Even 
without differences in initial procedures, however, the relatively 
low rate of cooperation among selected schools that is enjoyed by 
publishers would make the representativeness of the norms questionable. 
The^ack of representativeness and comparability creates difficulties 
when schools or ^hool systems change from one battery Jto another or 
when an attempt is made to interpret scores of transfer ""Vt^ents. 
Such difficulties, however. Mere not suf^cient to motivated -^ajor * 
norming and equating study across several publishers. 

There are many technical and political' obstacles to equating 
tests' acrqss publishers (see Angoff, 1964; Flanagan, 1964; Lennon, 
1964a ;^Lin^qui^t;, 1964).^ Axstrong motivation was needed to attempt 
to overcome t;hese. obstacles. . This motivation was provided by the 
increasi^ng demand Xor evaluations at the state and national level 
that occurred durkn^ the latter part of the I960's. Early attempts 
to obtain achievement test data for the nation^il evaluation of 
Title I, for example, were faced with 'a hodgepo|j|e^f different * 
tests with different norms and different scales (Ldi^t ,^ 1974) . 

\^ 

A major technical problem in equating tests of different pub- 
lishei^s is that the tests m^y not measure the same charapteristic. 
Angoff "(1971) , lists two reqii^rements for equating, the first, of 
which is that the • • instrumeril;^ ,in question must measure the same 
characteristic* (p. 573). With diff^ent content^ecif ications 
used by different pubMshers, the satisfaction , of "this, requirement 
seemed dubious. Int^rcorrelations among the tests obtained in a 
pilot study were found to be high enough, however, to make the 
equating seem worthwhile (Jaeger, 1973). 

■ 2d4 



k-2 

METHODOLOGY 

The study was designed with two major phases: the norming phase 
and tlie 'equating phase. The norming was designed to provide national 
norms* for individual pupils and for. school *ineans. The norms were 
developed for the vocabulary and tfie reading comprehension subtests 
as well as total reading for the I'tetropolitan Achievement Test, 1970 
edition (MAT)* . The data were collected in April '1972 at grades 4, 
5 and 6 and hence provide spring noinns at those grade levels • 

* ' , ■ * 

The s-ampling design for the noinning study was devel^oped by West-at, 
Research, Inc. The design called for a stratified^ random sample o^-]l> 
940^ schools. The norms, needed to be as representative of the nation'^ 
4th, ^ 5th and 6th-grade students as possible and great care and effort' 
was devoted to the design of the sample. Frimary-sample^ schools were' 
selected, and for each school in the primary sample five schools with 
the same s"^?iing characteristics were randomly selected as seconder 
sample schools, to use in place of non-participants in the primary ^ 
sample^ Due to careful planning and advance work vith ,the Council (if 
Chief State School Officers and others, relatively little reliance had 
to be placed on the secondary-sample scl^ools ' (838 primary sample slhd 
80 .sedondary sample schools with a total of approximately 65,000 pupils 
actually participating ir\ the study). The high participation 'rate! is 
a real tribute to the many people ^involved in the planning and ^conduct i 
of the study.' It also greatly enhartces the value of the itprms by/ mini- ^ 
mlzing th§ due to non-cooperation and is undoubtedly the siij'gle ^ 

most important di^stinction of the study "norms in comparison to the ^' 



puTjlisher s ' norms • 

ike ^equating phase of the -study was designed to provide raw scored 
equivalences for total reading, the vocabulary subtest and the reading 
comprehension subtest of seven major tesr battQ^ries. Subsequently arV 
eighth test was equated^ to the original seven in a study conducted in 
the spring of 1973. The^-^ts,' forms and the levels used at each / 
grade level are -suimnarized ih Xable 1. By equating of each or the other 
tests ito the MAT (the anchor test) the norms obtained for the /MAT were 
translated to horms for each of the o^her tests. 



The sample characteristics for the equating phase are 'l^ss' crucial 
than in the normjng phase of the study but again this phase df the^' 
atudy achieved a very high participationsr^te. Usable equdting data 
'were obtained in April 1972 for a' total almost 135,000 students . ^ 
for the original seven tests. To equate dt^ (M to the anchcjfr te^t 
and through it to the other^six tests, usabX^, data were obtaiUed ^for 
another 14,400 studentfe in April 1973. * ^ v ^ 

The design of the administration of tests In the equa'^ting ghase 
called 'for a sample of students to take^ ueach pair of tests in ^rder 
AB*and \a sample in order BA*. A schematic ^representation of -thg 
e(}uatinfe design is >^hown in Table 2. As can be seen in Table 2, in 
a(|^tion to the pairing of *each test with every other test in both 



\ 



/ 





C/3 


C/3 


CO 
















00 525 






fD 




o . 


fU 


o 








ro o 






CO rt 


H rr 




H rr 


CO 3 








^ rr 






CD H* C 


fD ^ 


H* 


(D fD 
















0 CD 


CO 0 




CO CO 


- 


VO ht^ 






ri o ' 


o\ o 






rt *tJ 


xj H 


rt 1 


M fD 


0 






^ a 






CD rr 


CO 0 




CO S 




O 






rT ro 






0) t '* 


M 


CO 




CO fD 


3 






Iw 


iV 








CD rt 




3 


fD h** 






UJ sJ 




CL. <^ 


M ^ h-» 


rt 


CO 


J— » Q 




P, jij 






rT 


• I J 


• fT) 

• lU 








VO H» 










00 






0 H" 


^ 3 


s-/ o 


0^ D 


VD < 


O J> 






rt 




OQ fD 


o 






fD 


o 






si* 








> 




rr 


00 L 


3* ' 






D ro 




ft 


vo CD rt 


CD O 




fD H* 




H< 












(0 , CO 


p. 


(D 


p. 


p fir 


fD 






ID V] 


a 


" CO 


VO CO 


* H* 


CO 




p, (15 


<^ 






(D 




(P 










• rt 


fD 






IS s« 




•» 






p 


?(*fD 


CO 


a 
? 






(D (D 


rt 














fD 










ro 




n 


CO 


P, 


o 


3 










OD 








H* 


(>t) 


^» rt 






rt rt 


ro 

CO 




No 


3 




3 
OQ 




H 






Co 


rt 








M 




0> 


fD 






ro rT 


CO 




1 
1 




CD 




00 


CO 




















ft 






00 














o 


40 






rt h** 






















C D 






















a 






















rr 












































o ro 
o 






* 
















0 o 






CO 




M 




o 


n 


H 




o- li 


w 


cn 


H 




H 




H 


H« 




C 


















rt 




O 










CO 




CO 




H 




rt ^H. 


















fD 


03 


(t) D 




















ft 


a (D 


1 






* 












(D 


H 












































D H 


























• » 


















72 d 
1973 




\ 








O 




0 

B 




• 0 






















rt 


M • 






















D - 
rt 








* 1 












o 


(D 










cn 










0 


w 




CD 




c 










M 




H 














1^ 






ro 


C 








< 


to 






(D 




r 






O 


(D 










O 


p. 






J rr 














rt 


(D 






















ft 




















O 


ro 




















D 


M 



















U3 
c 

\ ? 

a 

rt' 
O 

' rt 
- sr 

fD 

' o 

rt 
3* 

fD 




M 
D 
rt 

fD-* 

fD 
CU 

rt 

fD, 



C 
fD 




rt 



P3 

rt 



rt 



CU 
H- 

rt 
(D 



< 



|c^ 



H 
(D 
03 

rt 



o 



D 
P. 

fD 
< 

m 
cJ 

00 
(D 
Of 



rt 



O 
D* 




Table A-2 

Schematio Refyj^sentation of Equating Study Design 
Test Administration Order (April 1972) • 



I 



y 



Test 


1 


^2 


3 


'4 


5 


6 


7 


1. CAT 


.1-1* 
1*-1 


1-2 


1-3 


1-4 


— • 

1-5 


1-6 < 


1-7 


2. CTBS 


2-1 „ 


2-2* 
2*-2 


2-3 ; 




2-5 


2-6- 


2-7 


'3. ITBS 

■*- 


3-1 


3-2 


3-3* 
. 3*-3 


3-4 ' 


3-5 


3-6 


3-7 


4. MAT 


4-1 


, 4-2 


4-3 


' ' 4-4* 
44-4 


4^5 


K 

4-6 


4-7 


5. STEP . 

' — 


► -5-1 


5-2 


5-3. 


5-4 


5-5*: 
> 5*-5 


5-6 


/ 


6. SRA 


6-1 


6-2 


6-3 . 


■ 6-4 


6-5 


6*-6 


-> 

' 6-7 

. \ 


7. SAT 

1 


7-1 

i 


.7-2 


^ 7-3 


7-4 ■ 

y 


/7-5 

/ 


7-6 


7-7* 
- 7*-7 



Test Admitiistration Order (April 1973) 



Test 


8 


4 


8. 


GMT 


8-8* 


8-4 






'8*-8 




4. 


MAT 


4-8 





Infiicates an alternate form of the test 



4- 



2 4 1 



possible orders, each test wa^ .also .pai'red with its own alternate 
form in both an AB and a BA ordetT~" ThiS^provided for parallel-form 
reliability estimates for each test. / 

* ' / * 

Eight combinations of two equating methods/(linear and eq ui-* 
percentile) and four equating procedures C^ri^jLvlug the as^TJt ' 
different subsets of th6 data ^from the^^^^j^dlgn shown in Table 2) were 
used'vto equate e4ch pair of tests • Tliese combinations of method and 
pro cea&3?aj5''''ve«r compared to each other and also evaluated in terms 
of estimated errors >of equating. 'Based on these results, the equi- 
percentlle method and a procedure that involves pooling all the data 
for a given test for each order of administration and then averaging 
the equating results were found to be most^ satisfactory. 

Following^ the equating of raw scores on all of the tests the 
percentile norms for individual pupils and for school meanb were 
obtained from the MAT norming study results. Comparisons of these 
norms to the jiorms provided by the publishers were then prpvided. 
Finally, the adequacy of the equating for several subgroups of stu- 
dents was investigated. , 

: ; ^ *TEE REPORT 

Despite the voluminous nature of the ATS report readers should 
have relatively little difficulty in obtaining desired information 
from it regardless the level, of detail tHat is required. The ' 
needs of most users are amply met/ in a 92-page separate report 
entitled ^'Anchor Test StucJy: Equivalence and Norms Tables for ^ 
Selected Reading Tests" which is available from the U.S. Government 
Printing Office as stock number 1780-01312 at a cost of $1.90. - 
This ^report contains a" brief description of the study and the primary 
tables that r-esulted from the study. The tables are divided ;into 
four major cattegories: equivalency tables, -tables of individual 
score *noj:ms, tables of school mean norms, and a table that pxe^eiHTs 
a comparison ATS percentile ranks with the corresponding percentile 

ranks from the publishers' norms. ji 

J~ 

For the reader who desires more technical d^tai4/j*th6 "two volumes 
containing the "project reports" will usually ^ff ice. These 
volumes which have the catchy titles, "Ancho^Te^ts Study. Final 
Report. Project Report" and "Anchor Test^^udy 'Supplement. .Final 
Report. Volume 31, Project Report" m^yoe obtained from ERIC as 
documents ED 092 601 and ED 092 632 respectively. These reports 
contain detailed descriptions of the study methodology including the 
sampling, estimation 'and equating procedures. They also contain a 
discussion of the major results and technical evaluations^ of the stud; 
results. At this level the reader may also want to skim-through some 
of the tables and graphs in Volumes 2 through 27 as well as those in 
30,, 32 and 33 to evaluate the adequacy of the summary and description 
of results in the project reports. I think that a small s^ampling of 



A-7 

those tables and graphs will impress most readers with the thoroug 
ness and/scrupulous accuracy of repotting in the' project reports. 

• 

For anyone who w^nts to dig beyond the project reports I can 
only say that the tables and graphs are available through ERIC in 
quantitit^s that should satisfy even the most: heary of appetites'. 
Volumes 2 through 4* provide^ equating tables for the 8 combinations 
of methods and procedures add-i-t-iuii to estimated errors of equating, 
and test^ inter corr.ela^ont for grades 4, ^ and 6 respectively. Vol- 
umes 5* through IP provide graphs which cpmpare the equating lines' 
for different procedures and for different equating methods at each 
grade. Volumes 11, through 21 present subgroup equating tables (boys, 
girls, 3 .IQ groups,^ 3 racial groups, and 3 SES groups). Graphs com- 

^paring the subgroup equating results to each other and to those for 
the- total group are presented in Volumes 22 through 27. Volume 30 
ps:esents a -comparison of the ATS norms with those provided by the test 
p)iblishers, and reports conditional errors of equating, (i.e., the 

^standard^ deviation ^of observed scores on test j around the equivalent 
' score pf test j for each* value of test j') quality .control results 
and ^ihfiormation on the convergence of equa'titig iterations. 

The information in the first 30 volumes and in the project report 
is all concerned with the 7 reading acfiievement tests that were in 
the original study. (See Table 1.) * The Supplement Report (Volumes 31 
through 33) gives results of a study conducted a year after the 
original study for the purpose of equating an eighth test (the Gates 
McGinitie) to the original seven. 

SELECTED RESULTS 

MAT Norms ^ ^ ^ * 

The norms that were obtained for the reading test of the MAT are 
probably the best national norms that have ever been obtained' for a 
standardized achievement test. As already noted the school cooperation 
rate was exceptional. The s^ple design and weighting procedures were 
of very high technical quality. 

V 



Although it is unlikely to cause anyone any real difficulty, it 
might be noted that the tables that belong in Volume 4 have been 
inadvertently put on the Volume 5 microfiche (ED 092 606) under 
the title '^Equating Procedure Comparison Graphs, Grade 4". - 
The graphs that belong in Volume 5 are to be found on the ^ ^ I 
Volume 4 microfiche (ED 092 605) under the title "Equating ^ * 
Tables, Errpr of Equating and Correlations, Grade 6". 



•.- . A-8 

Test Intercorrelations 

Despite many* reservations about the equating of reading tests with 
different content specifications the tests were all found to have ttegh 
intercorrelations. Generally, the correlations for each test with 
each of the* other tests fell little short of the correlation of that 
test with i,ts alternate form. \Then^^^e parallel-forms reliability 
estimates were used to obtain disattenuated correlations among the 
tests, very 'few of the correlations fell below .95, which is often 
used^as an admittedly arbitrary cutoff for purposes of equating. 
Averaging across order of presentation, the disattenuated correlations 
for pair? of tests below .95 ^are listed in Table 3. All t"hree cases 
at grade 4 involve the MAT, all four at grade 5 involve the SAT and 
all fdur at grade 6 involve the STEP. None of the disattenuated and 
averaged-over-order ^:orrelations among reading tests fell below .,89 
and the tests with low correlatioi^s changed from one grade -level to the 
next. Although I agree with the judgment made by the- investigators 
that the correlations are sufficiently high to j.ustify the equating 
in all cases one is left with a curiosity about the tests that ate 
involved in the "low" correlations a^t each grade. ^ 

In the case of the STEP test at grade 6 it may be that the "low" 
correlations are attributable to the difficulty level of *''^TEP being 
somewhat out of ph*ase with the other tests. Among the 7 tests in the 
original study for which the test intercorrelations are available, STEP 
is* the only test that doesn't change levels during'^the Ath to -6th 
grade inter:val and by the spring 'bf grade 6 STEP is an easy test rela- 
tive to the otKer tests. Partial support for this interpretation can 
be fouti^ in Lord (1974). Despite the- high intercorrelations of the 
tests Lord found the 7 tests ^in the original stud^ to have fairly differ 
'efnt patterns of f e 1 arive- -ef f i^e i en cy at different percentile ranks. 
STEP^-was the only test to have higher relative efficiency than the 
MA.T's at low petcentile ranks but lower relative efficiency at middle 
and high percentile ranks. 

Error of ^Equating - ^ > e . 

An important aspect of the equating design wasi^the provision that 
made possible empirical estimation of the .error of equating. This is 
accomplished ^y the use of McCarthy' s balanced half-sample r^puLication 
method (1966). The equating design consisted of a set of eiglit bal- 
anced half -samp les . These half-sample replications were used to com- 
pute the root-mean squared deviation of the MAT equivalent scores 'for 
each half-sample replication about the MAT equivalent scores for the 
full sample. These errors of equating were computed for each of the 
eight combinations of methods and procedures and provided a means of 
judging the relative quality of the methods. The estimated error. of 
equating also provided a basis for judging the overall adequacy of the, 
equating for each test. For the preferred equating procedure and 
me'thod (i.e., the average' of procedures' 1 and 2 and the equipercentile 
method) the estinjated error for all tests was generally less than one 



( 

A 



Table A- 3, 



Pairs of Total Reading Tests with Disattenuated Correlations 
Averaged over Order of Presentation Below .95 
(Value of correlation reportelfe in parentheses) - 



Grade 4 - Grade 5 Grade C 



MAT-CAT (.94) SAT-STEP ^(.89) STEP-CAT • (f^) 

\ 

MAT-ITBS (.93) ' SAT-CTBS (.92) " STEP-CTBS (.91) 

MAT-SRA (.^93) SAT-CAT (.94) STEP-ITBS (.92) 

- SAT-SRA (.94) STEP-SAT- (.93) 



i 



raw score point (substantially so in most cases). The only major 
exception to this is for test scores in the "chance" range. Based 
on these error of equating estimates, the Equating would seem quite 
satisfactory for most practical purposes. ^ 

Comparison to Publishers^ Norms 

Once the tests were. equated the norms obtained for the MAT were 
used to convert equivalent raw scores on all other tests to percentile 
ranks. Thus, the anchor test norms can be used to obtain naLloually — ' 
representative norms for » all of the tests. With norms for all tests 
in hand, the next i;iatural step was to compare the ATS norms to the 
norms provided by the publisher. The maximum difference between the 
ATS percentile rank (PR) of any test scorfe and the PR of that same 
score on the publisher? s nortis is listed in Table 4 for each test at 
each grade. Also summarized in Table 4 is the typical sign o*t the 
ATS PR minus the publisher's PR for. scores above and for sdores below 
the median. 'A plus sign indicates that a given raw soore would 
typically have a higher PR on the ATS norms than on the publisher's 
norms. In other^words, a given score would appear better according 
to ATS norms than' publisher ' s norms where there is a plus sign. The 
converse i§ true of a minus sign and a zero indicates- that there is 
not a consistent differei^ce in that the PR's are essentially equal. 

As can be seen in Table 4, the maximum difference is relatively 
small for most tests at most grade -levels. The SAT, and to a lesser 
extent the GMT (grades 4 & 5) anc^ tli§'MAT (grade 4) .are notable 
exceptions to this statement. The differences for those tests are 
substantial. It may be of interest to note that the GMT and the SAT 
are the oldest of the eight;^tests. As indicated in Table 1 the SAT 
and ^GMT used in the ATS" were both 1964 editions.. It should also be 
noted that since the ATS was undertaken a new edition of the SAT has 
been published. (Harcourt Brace Jovanovich, 1973). Thus, the large 
differences 'for the SAT are somewhat irrelevant.^ The other large 'dif- 
ference (MAT grade 4) may be attributable to the fact that separate 
answer sheets- were used in the ATS whereas the publisher's norms at 
grade 4 a?e based on scorable test booklets. 

For* use with the interpretation of individual scores most dif- 
ferences between publisher's and ATS norms are not large enough to 
cause ucoblems. If someone is interested:^ evaluating trends for 
groups 1^ students, however, changing from publisher's norms to ATS 
norms might make quite • noticeable difference. To get a better fix 
on implications of changing to ATS norms for grpup data it would be 
desirable 'to have a table 'like Table 4 showing the differences , 
between ATS school mean norms arid publishers' school mean uOras . 
>-Not all publishers provide such norms, however. 

Subgroup Results - * 

9 

The tests were not only equated for^'the total sample but also 
for eleven special subgroups resulting" from four breakdowns of the 



A-11 



Table A-4 
Summary of Comparisons of ATS Norms 
with Test Publishers Norms 

/ 



4 

\ 

Test Grade 


Maximum ^Difference 
in Percentile Rank 


Typical Sigii of ATS 
Minus Publisher's Rank 


Vocabulary Comprehension Total 


Below Median Above Median 

- - 0 

0 


CAT 4 

5 ; 

6 


2 • 3 3 

3 ■ ' 2 ^ • .3 

4 - 2 3 


CTBS 4 
5 
6 


* 

4 2 3 
3. 3 3 
4 6 5 


0 

+ 

-h + 


GMT 4 
5 
6 


3 • 10 * 
3 '8 * • 
3 * 


0 + 

0 . + 

0 0 


ITBS 4 
5 
6 


5 5 * 

6 7 * 

6 7 ■* 


+ + 

+ ' + 

+ <» + ' 


MAT 4 

5 , 
6 


3 3 2 
3 2 3 

3 3 2 


+ 0 
+ . 0 

c 

+ ' 0 ' 


STEP 4 
5 
6 


* * 5 

* ' * 5 

* * 4 ' 


+ ^ + 
+ + 


SRA 4 

5 • 

6 


5 3 3 
5 . ' 2, 3 
4 • ' 2 2 


+ ^ + 

# 

.+ ■ + ' ■ 

— — ^ ± 


SAT *• 4 • 
5 

6 ■ 


8 11 ■/ • * 
15 12 * . 
18" ,16 ■ •* 


+ . ■ + 
+ + 



Publisher's norms not provided. 



A-12 

'Sample on the basis of sex, SES, IQ, and race. For the sex break- 
•'do\»hi no major differences were found. The results for^ the three IQ 
grouY)S showed some differences but generally the differences were 
small except in regions where the data were relatively sparse. Thus, 
the total group equating tables appear satisfactory regardless of 
sex or IQ level. 

The results of SES and for race were less similar. There was a 
consistent tendency at all grade levels for the high SES children to 
•score higher on the CTBS than on any of the other tests and for low 
SES children to score lower on the SRA than any other test. 

Marked differences in equating lines were also found for sub- 
groups formed on the basis of race. This is particularly true for 
the Spanish-sumamed sub-group which tended to score consistently 
lower in the top part of score range on the ITBS and SRA than on the 
other tests. The deviations for the black sub-group were not ^ 
large as for the SpanjLsh-sumamed sub-group. Furthermore the devia- 
tions for the black sub-group were not consistent over all grades. 
There is some tendency at the upper score ranges, however, for blacks 
to score higher on the CTBS and SAT than on other tests at grade 4 
and to score higher on the ITBS than- on other tests at grades 5 and 6. 

6 Although the sub-group equating results are undoubtedly the most 

[ provacative of the entire study it must be, noted that the .study 

was not explicitly designed to yield stable equating relationships 
for the minority sub-group children* (ATS. Final Report . Project 
Report , ^p. 196). The • sample size for the minority groups is extremely 
small in the parts of the score range where the largest differences 
were observed. Hence, *the advice of the project report against using 
the racial sub-group equivalency score data. is probably sound. But, 
this is an area of concern that de's%i:Ves liore intensi-^e study and such 
work is currently under way (John Bianchini, j)er§onal communication) . 

UTILITY 

The Transfer Student* \. 

In the announcement of the ATS contained in the fall 1974 issue of 
ETS .Developments (ETS, 1974^ a hypothetical girl named Mary is described 
' Mary and her parents moved. Her "new" school uses the ITBS but her 
old one used the STEP. Thanks to the ATS, Mary's new teacher can con- 
" vert Mary's raw score on the STEP Reading to. an. equivalent raw score oh' 
the ITBS Reading. It might be added that either of these raw scores 
can be interpreted in terms of the national norma provided by the ATS.. 

Althojigh the above claim is 'true it assumes that the teacher will ^ 
(1) know abcilit the ATS and (2),hav6 the equivalency tables available. 
Both of these assumptions seem questionable to me« A major effort 
would be required^to make this type of information broadly known by 
/"^^^feachers. One way of accomplishing the^ goal might be for the publishers 



254 



A-13 

to do the. conversion to ATS percentile ranks for the users, and indi- 
cate the tests fpr which the percentile ranks are equivalent. With- 
out such heavy use by publishers, however, I doubt that Mary's teacher 
would know hov7 to convert Mary's score even assuming that she received 
raw scores rather than grade equivalents or some othet standard score 
•for Mary. 

V 

The need for publisher involvement to make the ATS results maxi- 
mally Ujsefiil prompted me to write to the six publishers that produce 
the eight tests involved in the ATS to ask about their plans. In the 
fairly ^limited time between my letters to publishers and the writing 
of this review i received responses from four of the six publishers. , 
None of these four publishers plans to routinely provide ATS norms 
to their users. But, they all plan to make wthe information about the 
'.Study available fey infcTrraing their scales representatives and/or 
(jescribing^the stu3y<ln their publications, s 

The limited effoKt on 'the^ar:t_of^publishers to make ATS norm^ and 
equating results known^may be as much as~^^xKi:uld be expected of the 
publishers. It seems doubtful to me, however,, that the planned level 
C'f' effort will be sufficient to ^get a very large segment of the test 
users (including Mary's teacher)' to use the ATS results. 

By Wy of explanation of their limited plans 'to use th6 ATS 
le^sults the publishers cited several practical limitations of the 
le^uJts. These limitations included: (1) 1:he lack^ of data for tests 
dther than reading^ (2) the lack of data for grades' other than A, 5 
and 6, (3) the lack of data for the publisher's alternate forms, and 
(4) the lack of scaled scores. All of these factors were viewed as 
limiting the practical value of the ATS results for their users. 

Xph^ging Tests - ' 

^^S^ljo61s are sometimes slow to switch from one test to another 
because o^experiTence with one test and the comparative value of the 
historical data. * The ATS results make it possible to make a change 
and still have the ability to compare current reading test results 
to historical results in terms of the ATS norms. Again t^his assumes 
that the knowledge of this capability -is available- to the school. 

Measuring Change . 

Another use that has been usggested for the ATS data is in the 
measurement of change where one publisher's test is -used at time 1 
and another publisher's test at time 2. Presumably this could be - 
*dojie in terms of percentile ranks. This might be appropriate for 
gauging the direction of change in relative standing as suggested 
by Coleman and Karweit (1970) but not fpr estimating the mafgnitude 
of change. There are major differences between change as i^asured- 
in terms o£ ^percentile ranks and as measured in tetms of a Vertically , 
equated scale such as grade equivalents, (see for example Linn, 1974). 



25 o 



A-14 

The ATS was not designed to vertically equate tests that change 
levels from one grade to the next. It does provide ^^me ' indirect 
information for this purpose, however. For example, the same level 
of the CAT was used at grades 4* and 5 but different levels of the MAT 
were used at those grades (see Table 1). By us^n^ the CAT equiva- 
lencies of the MAT it is possible to convert theTMAT Elementary Level 
Reading "^ores to equivalent Intermediate Level Reading scores. There 
are a number of other Xests with a*constant level over grades 4 
and 5 that might be used for this purpose afid for the best estimate 
it would be desirable to use some sort of combination of the various 
estimates. For purposes of illustration, however, I selected a few^ • 
scores of the CAT at grade 4 and noted the equivalent Elementary 
Level MAT scores. The same CAT scores were then used at grade 5 to 
find the equivalent Intermediate Level MAT raw scores. These scores 
are shown in Table 5. Finally, the publisher's norms were used to , 
convert ^the equated MAT Elementary and Intermediate raw scores to 
^rade equivalent scores. The resulting grade equivalent scores are 
also reported in Table 5, ' 

I^ the two columns of grade equivalent scores in Table 5 are com- 
pared some non-trivial differences in the grade equivalents can be 
observed. The largest of the differences in corresponding grade 
equivalents shown in Table 5 occurs for MAT raw scores that are equiva- 
lent to a CAT raw score 9f 60, At this level the grade equivalent 
scores are 6.6 at grade 4 and 7,4 at grade 5 for a difference of V 
,8 grade equivalent -units which would presumably be Interpreted as • 
almost a'^'year's gain," Throughout the range the grade equivalents 
tend to be larger at grade 5 than at grade 4. 

The above analysis in terms of grade equivalent scores is admittedly 
rather crude and does not begin to scratch the surface of the number of 
possible comparisons of this type that might .be made. It is not in- 
tended to imply that growth should be. measured in terms of 'grade equiva- 
lent units, in fact; I have elsewhere argued to th^ contrary (Linn, 
1974). Furthermore, the results in Table 5 may be an artifact of 
the nature o^ grade equivalent scores and they are not the score 
unit to use in equating. But, the p'erson who is interested in measur- 
ing change needs some sort of common score and will usually want some- 
thing be^des percentile ranks. If so, some form of the publisher's 
scaled scores is still the natural Recourse, The above analysis sug- 
gests that the results of such comparisons may be very misleading at 
least if grade equivalent scores are used. 

Aggregation of Results f rom\everal'^ests 

Possibly the most"; significant use of the ATS may come from making 
it possible for a- governmental agency to aggregate reading test scores 
across several tests. This is a potentially important use in that it 
conceivably could greatly t^educe the need for special test administra- 
tions for information purposes at the state or national level. As noted 
previously programs such. as Title I ran into consideijable difficulty in 



2bd 



< Table A-5 

Total Reading Equivalent Scores on the MAT 

* 

Elementary and Intermediate Levels 

^ ^ Equivalent MAT Raw Scores 

and Corresponding Grade Equivalejits 

Elementary Level (Gr> 4) Intermediate Level (Gr^ 5) 

Level 3 CAT " 



Scores 
!S 4 &- 5) 


■^Raw Score 


Grade - 
Equivalent 


Raw ScorV 


Grade 
Equivalent 


80 


94 


9.9 


91 


9'. 8 


70 


89 . 


8.4 


76 


-8.4 


60 


84 


6.6 


• ' 63 


7.4 


50 


•76 


5.2 


51 


5.5" 


40 


63 


■ 3.7 


39 


4.4 


30 


.45 


3.2 


29 


•3.5 


20 . 


x 2^ 


2.3 


r 20 


2.6 


/lO 


12 


1.3 


8 


1.4 



' ' * * A-16 

trying to make sense out of test score dat-a'frpm^a wide variety of 

tests. State agencies have had similar problems which has lecf to thfe ^ 

use of single tests for statewide testing in some cases. Thanks to 

the ATS results schools should be free to select their own reading 

test from among the eight Ijjyolved in the ATS while the capability 

of aggregating datfi at th^^strict, state or national level is still 

maintained. ' , * 

< ^ f 

1 

I would not find it surprising ^f aggregation is the ^main use^ , 
that is made of the ATS results. After all^ it wks the desire to have 
this capability that made the ATS a reality after over 30 years since" 
Curetoti (1941) made his plea for an anchor test stiijdy. 

LIMITATIONS 

In my opinion, th^ATS is an extraordinarily sound study from a ' 
technical point of view. Most of the limitations, some of which have ' 
/byen implicitly noted a\?ove, come about mofe from the scope Pf^tl^e - 
«Mdy than from the implementation. There are three rather obvious 
limitations of this nature that I would like to mention at this stage. 
.These are (1) test content, (2) grade levels, and (3) the absence of 
.vertically! equated scaled scores. ' ^ , • 

Although reading would probably be most people's first choice if 
a single content area is to be involved, there are obviously other. 
Important, content areas. Many would argue that even a complete achieve- 
ment test iJattexy puts the focus on much too narrow* a range of educa- 
tional goals.* By making it possible to aggregate only for reading tests 
the emphasis becomes even narrower.. Although equating 'of tests in 
other content areas may be desirable it would be unreasonable to expect ^ 
one study to do everything and the ATS is already a giant. Furthermore, 
the technical feasibility pf equating in other areas may be limited due 
' to less similarity in what is measured in content areas other than, 
reading, from one test battery to the next. 

The -choice of grades A, 5 and 6 was partially based on high test ^ 
usage at those grades. They are a reasonable starting place but the 
^ same problems that prompted the ATS remain unresolved at other grade 
level^. ' ^ ♦ , 

V Ihe absence of an effort to vertically equate tests that change 
levels in grades A, 5 and 5 and create a common scaled score is 
* regrettable from my perspective. Without doing this the test user ^ 
who wants ^to analyze scores across levels must. revert to the publisher's 
norms. ' As gogd fas* the , publisher 's norms may be., they do not live up 
. to the ATS standards. 

** * 
I also think that the absence of a common sca^led score is a missed 
golden opportunity. ' By creating a new scaled score that is common to - 
all tests it might have been possible to reduce the diversity in types 
of scaled scores which confuse users and more importantly to speed the 



2-58 



demise of ^ some undesirable types of scores. ln|this way the ATS 
might have helped achieve standard D5.2.3 of ^e 1974 Standards for 
Educational and Psychological fe^ts (APA, 1974^. According to 
standard D5.2.3 "Interpretative scores that lend themselves to gross ^ 
misinterpretations, such as mental age or grade-equivalent scores, 
should be abandoned or their use discouraged. Very Desirable" (APA\ \ 
1974, p. 23). The absence of scaled. -^coWs could be rectified ^ 
through secondary analysis of the data. The data that are required 
are available. ' * ^ - - ^ 

• A final limitation that Vd like to mention has to da with time 
father than scope. As notea\above, one of the test batteries (The 
SAT) has already been revised\. This is apt«to happen to several of 
the others within the jxext 5 br 6 years. In view of this it seems 
unfortunate that there was ^delay of ^almost t\/o years betwe^in the 
completion of the final report and its release by USOE. * 

CONCLUDING REMARKS 

, The A^S is a landmark study. It is a tribute to careful planning, 
superb execution and high technical^ capability . )The goals of obtain- 
.ing representative norms and equating several widely used reading tests 
at grades 4, 5 and 6 were clearly accomplished. So too, were the 
several minor goals : .The results of the study should prove to be of 
colxsiderable practical value especially to governmental agencies that 
want *ttr^ggregate scores across several tests. The data bank which 
was created by the study should ^be valuable for a number of secondary 
analyses . . * 

Despite ^these major accomplishments, one need^only look back, at 
Cureton^s original plea for an anchor test study to realize that there 
is a^long way to go to achieve his id^al. ' According, to Cureton, "An 
i^eal system of .norms^ should be based on a specially constructed .and 
standardized test, and its units should be stable from year* to year, 
from test'. to test, and from early childhood to old age. They should 
alsb be as dire.ctly meaningful as possible in terms of the existing 
concepts- of the population in general and the teaching populatioji in > 
particular- • -The ideal anchor test should yield separate scores for 
all the major intellectual factors in the school achievement complex" 
(l54l; pp.' 291-292). We* clearly have a ways tp go. Given the expense 
of equating tests of reading at three grade levels and the fact that 
other content areas and other .grade levels pose more difficulties it 
seems doubtful ',to me^that we ^will achieve Curet6n's goal. 



• .REFERENCE Sf 

■ ^" - t ' ■• . 

American Psychological Association, Standard^ for Educational and 
Psychological Tests , Washington, D. C. ; 1^74. 

Angoff W. H. Equating non-parallel tests, J(3jarnal of Educational 
Measurement , 1954, 1^, 11-14.*, 

Arxgoff, ,V7. H. Scales, norms and equivalent scores*^ In R. L. Thorndike 
'(Ed.) Educational Measurement 2nd Edition, Washington, D. C: 
American Cfouncil on Education, 1971. 

Coleman, J. S. '& Karweit^ N. L. Measures of 'School Performance , Santa 
Monica, California: Rand, R-488-RC, July 1970. 

Curetoi\, E. E. Minimum requirements in establishing and reporting 

norms on educational tests. Harvard ^Edlicational' Reviev , 1941, , 
Ij^, 287^300 • 

■*( » 
Educational Testing Service. ETS Developments ^ Princeton, New^ersey, 
* •Educational Testing Service, 1974," 2^1, No. 4, 

Flaitagan, J. Equating non-parallel tests. Journal of Educational 
' 'Measurement , 1964, 1-4. 

Harcourt Brace Jovanovich, Stanford Achievement Tesj; , 1973 edition. 

New York: Harcourt Brace Joyanovhic,- 1973. 
♦ * - . 

J*a§ger, R^ M. -The national test equating study in reading (The 
-anchor test study). Measurement in Education , 1973, 4, 1-8* 

Lennon, R. *T.* Equating QQn-parallel tests. Journal of Educational 
" Measurement , 1964, 1,. 15--18, (n) . 

LennOn^ R.' T. Norms'. In Proceedings of 1963 Invitational Conference 
of Tefe^ing Problems , Princeton, New Jersey: Educational Testing 
Service,Sa74 (b) . • ' 

Lindquist^ E. F. Equating non-parallel tests. Journal of Eduga^tional 
Measurement , lw4l, 1,> 5-10. 

Linn, R.* L. The use of standardized test scales ta measure growth; 
Conference of Policy Research ; Methods and Implications , ^ 
University of Wisconsin, kladison, Wisconsin, May 1974. 




Lord, F. M. QuickN^timateV^of the relative efficiency of two tests 
as a function of ability X^el* Jo,urnal of Educational Measure- 
. ment, 1974, [11, 247-254; \ . . - , 



2b() 



Loret, P. g/ The 'anchor test study. Paper presented at the 1974 
Arizona Education Association — Education Fa^ir, Phoenix 
Arizona,. Oct. 31-Nov. 1, 1974, 

McCarthy, P. J. Replication : An Approach to the An alysis of Data 
from Complex Surveys , Washington, D. C: National Center for 
Health Statistics, Vital and Health Statistics, Series 2, 
No. 14, 1966. ^ 




